[GitHub] [hudi] Liulietong commented on a change in pull request #2584: [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on rea
Liulietong commented on a change in pull request #2584: URL: https://github.com/apache/hudi/pull/2584#discussion_r583434679 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java ## @@ -104,7 +104,7 @@ public boolean hasNext() { throw new HoodieIOException("unable to initialize read with log file ", io); } LOG.info("Moving to the next reader for logfile " + currentReader.getLogFile()); - return this.currentReader.hasNext(); + return this.currentReader.hasNext() || hasNext(); Review comment: 1. When 'spark.speculation' is enabled, two tasks trying to append one log file, the second one will create a new logFile because it can't get lease of file. The second task will leave one zero-size log file when the first task succeed. 2. Good idea This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Liulietong commented on a change in pull request #2584: [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on rea
Liulietong commented on a change in pull request #2584: URL: https://github.com/apache/hudi/pull/2584#discussion_r583434679 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java ## @@ -104,7 +104,7 @@ public boolean hasNext() { throw new HoodieIOException("unable to initialize read with log file ", io); } LOG.info("Moving to the next reader for logfile " + currentReader.getLogFile()); - return this.currentReader.hasNext(); + return this.currentReader.hasNext() || hasNext(); Review comment: 1. When 'spark.speculation' is enabled, two tasks trying to append one log file, the second one will create a new logFile because it can't get lease of file. The second task will leave one zero-size log file when the first task succeed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2607: [HUDI-1643] Hudi observability - framework to report stats from execu…
codecov-io edited a comment on pull request #2607: URL: https://github.com/apache/hudi/pull/2607#issuecomment-786454326 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=h1) Report > Merging [#2607](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=desc) (ab93c26) into [master](https://codecov.io/gh/apache/hudi/commit/022df0d1b134422f7b6f305cd7ec04b25caa23f0?el=desc) (022df0d) will **increase** coverage by `18.28%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2607/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2607 +/- ## = + Coverage 51.26% 69.54% +18.28% + Complexity 3241 363 -2878 = Files 438 53 -385 Lines 20126 1944-18182 Branches 2079 235 -1844 = - Hits 10318 1352 -8966 + Misses 8955 458 -8497 + Partials853 134 -719 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.54% <ø> (+0.10%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...nal/HoodieBulkInsertDataInternalWriterFactory.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmsyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2ludGVybmFsL0hvb2RpZUJ1bGtJbnNlcnREYXRhSW50ZXJuYWxXcml0ZXJGYWN0b3J5LmphdmE=) | | | | | [...in/java/org/apache/hudi/common/model/BaseFile.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0Jhc2VGaWxlLmphdmE=) | | | | | [...di/common/table/timeline/HoodieActiveTimeline.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZUFjdGl2ZVRpbWVsaW5lLmphdmE=) | | | | | [...oop/realtime/HoodieParquetRealtimeInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL0hvb2RpZVBhcnF1ZXRSZWFsdGltZUlucHV0Rm9ybWF0LmphdmE=) | | | | | [...mmon/table/log/HoodieUnMergedLogRecordScanner.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVVbk1lcmdlZExvZ1JlY29yZFNjYW5uZXIuamF2YQ==) | | | | | [...e/timeline/versioning/clean/CleanPlanMigrator.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5QbGFuTWlncmF0b3IuamF2YQ==) | | | | | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh) | | | | | [...common/table/view/AbstractTableFileSystemView.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvQWJzdHJhY3RUYWJsZUZpbGVTeXN0ZW1WaWV3LmphdmE=) | | | | | [.../hadoop/utils/HoodieRealtimeRecordReaderUtils.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lUmVjb3JkUmVhZGVyVXRpbHMuamF2YQ==) | | | | | [...util/jvm/OpenJ9MemoryLayoutSpecification32bit.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvanZtL09wZW5KOU1lbW9yeUxheW91dFNwZWNpZmljYXRpb24zMmJpdC5qYXZh) | | | | | ... and [376 more](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree-more) | | This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and
[GitHub] [hudi] codecov-io commented on pull request #2607: [HUDI-1643] Hudi observability - framework to report stats from execu…
codecov-io commented on pull request #2607: URL: https://github.com/apache/hudi/pull/2607#issuecomment-786454326 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=h1) Report > Merging [#2607](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=desc) (ab93c26) into [master](https://codecov.io/gh/apache/hudi/commit/022df0d1b134422f7b6f305cd7ec04b25caa23f0?el=desc) (022df0d) will **decrease** coverage by `41.64%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2607/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2607 +/- ## - Coverage 51.26% 9.61% -41.65% + Complexity 3241 48 -3193 Files 438 53 -385 Lines 201261944-18182 Branches 2079 235 -1844 - Hits 10318 187-10131 + Misses 89551744 -7211 + Partials853 13 -840 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.61% <ø> (-59.83%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2607?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2607/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>
[GitHub] [hudi] t0il3ts0ap edited a comment on issue #2589: [SUPPORT] Issue with adding column while running deltastreamer with kafka source.
t0il3ts0ap edited a comment on issue #2589: URL: https://github.com/apache/hudi/issues/2589#issuecomment-786029402 @satishkotha Ran again on fresh table, still same issue. SparkSubmit: ``` spark-submit --master yarn --packages org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-utilities-bundle_2.12:0.7.0 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.scheduler.mode=FAIR --conf spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.executor.heartbeatInterval=120s --conf spark.network.timeout=600s --conf spark.yarn.max.executor.failures=5 --conf spark.sql.catalogImplementation=hive --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties" --deploy-mode client s3://navi-emr-poc/delta-streamer-test/jars/deltastreamer-addons-1.0-SNAPSHOT.jar --enable-sync --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor --hoodie-conf hoodie.parquet.compression.codec=snappy --hoodie-conf partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor --hoodie-conf hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable=false --hoodie-conf auto.offset.reset=latest --hoodie-conf hoodie.avro.schema.validate=true --table-type MERGE_ON_READ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --schemaprovider-class org.apache.hudi.utilities.schema.NullTargetSchemaRegistryProvider --props s3://navi-emr-poc/delta-streamer-test/config/kafka-source-nonprod.properties --transformer-class com.navi.transform.DebeziumTransformer --continuous --hoodie-conf hoodie.deltastreamer.schemaprovider.registry.url=https://dev-dataplatform-schema-registry.np.navi-tech.in/subjects/to_be_deleted_service.public.accounts-value/versions/latest --hoodie-conf hoodie.datasource.hive_sync.database=to_be_deleted_service --hoodie-conf hoodie.datasource.hive_sync.table=accounts --hoodie-conf hoodie.datasource.write.recordkey.field=id --hoodie-conf hoodie.datasource.write.precombine.field=__lsn --hoodie-conf hoodie.datasource.write.partitionpath.field='' --hoodie-conf hoodie.deltastreamer.source.kafka.topic=to_be_deleted_service.public.accounts --hoodie-conf group.id=delta-streamer-to_be_deleted_service-accounts --source-ordering-field __lsn --target-base-path s3://navi-emr-poc/raw-data/to_be_deleted_service/accounts --target-table accounts ``` Transformer Code: ``` public class DebeziumTransformer implements Transformer { public Dataset apply(JavaSparkContext javaSparkContext, SparkSession sparkSession, Dataset dataset, TypedProperties typedProperties) { Dataset transformedDataset = dataset .withColumn("__deleted", dataset.col("__deleted").cast(DataTypes.BooleanType)) .withColumnRenamed("__deleted", "_hoodie_is_deleted") .drop("__op", "__source_ts_ms"); log.info("TRANSFORMER SCHEMA STARTS"); transformedDataset.printSchema(); transformedDataset.show(); log.info("TRANSFORMER SCHEMA ENDS"); return transformedDataset; } } ``` When I add the column, debezium updates the schema registry instantaneously and new records start flowing. Its possible that deltastreamer gets the new schema records before even hitting schema registry. ``` Caused by: org.apache.avro.AvroTypeException: Found hoodie.source.hoodie_source, expecting hoodie.source.hoodie_source, missing required field test ``` Attaching logs: [alog.txt](https://github.com/apache/hudi/files/6044367/alog.txt) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
liujinhui1994 commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-786452099 I will add the unit test, and then please review This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1643) [Hudi Observability] Framework for reporting stats from executors
[ https://issues.apache.org/jira/browse/HUDI-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1643: - Labels: pull-request-available (was: ) > [Hudi Observability] Framework for reporting stats from executors > - > > Key: HUDI-1643 > URL: https://issues.apache.org/jira/browse/HUDI-1643 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Balajee Nagasubramaniam >Priority: Major > Labels: pull-request-available > > Hudi Observability framework to report stats from executors, using the > distributed registry. >- Hudi Write Stage Performance stats. >- Bloom Index stage stats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] yanghua commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…
yanghua commented on a change in pull request #2596: URL: https://github.com/apache/hudi/pull/2596#discussion_r583407438 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -258,4 +260,142 @@ public String getArchivelogFolder() { public Properties getProperties() { return props; } + + public static PropertyBuilder propertyBuilder() { +return new PropertyBuilder(); + } + + public static class PropertyBuilder { + +private HoodieTableType tableType; + Review comment: If we do not add comments, let us remove the empty line to make the code more compact. wdyt? ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -258,4 +260,142 @@ public String getArchivelogFolder() { public Properties getProperties() { return props; } + + public static PropertyBuilder propertyBuilder() { +return new PropertyBuilder(); + } + + public static class PropertyBuilder { + +private HoodieTableType tableType; + +private String tableName; + +private String archiveLogFolder; + +private String payloadClassName; + +private Integer timelineLayoutVersion; + +private String baseFileFormat; + +private String preCombineField; + +private String bootstrapIndexClass; + +private String bootstrapBasePath; + +private PropertyBuilder() { + +} + +public PropertyBuilder setTableType(HoodieTableType tableType) { + this.tableType = tableType; + return this; +} + +public PropertyBuilder setTableType(String tableType) { + return setTableType(HoodieTableType.valueOf(tableType)); +} + +public PropertyBuilder setTableName(String tableName) { + this.tableName = tableName; + return this; +} + +public PropertyBuilder setArchiveLogFolder(String archiveLogFolder) { + this.archiveLogFolder = archiveLogFolder; + return this; +} + +public PropertyBuilder setPayloadClassName(String payloadClassName) { + this.payloadClassName = payloadClassName; + return this; +} + +public PropertyBuilder setPayloadClass(Class payloadClass) { + return setPayloadClassName(payloadClass.getName()); +} + +public PropertyBuilder setTimelineLayoutVersion(Integer timelineLayoutVersion) { + this.timelineLayoutVersion = timelineLayoutVersion; + return this; +} + +public PropertyBuilder setBaseFileFormat(String baseFileFormat) { + this.baseFileFormat = baseFileFormat; + return this; +} + +public PropertyBuilder setPreCombineField(String preCombineField) { + this.preCombineField = preCombineField; + return this; +} + +public PropertyBuilder setBootstrapIndexClass(String bootstrapIndexClass) { + this.bootstrapIndexClass = bootstrapIndexClass; + return this; +} + +public PropertyBuilder setBootstrapBasePath(String bootstrapBasePath) { + this.bootstrapBasePath = bootstrapBasePath; + return this; +} + +public PropertyBuilder fromMetaClient(HoodieTableMetaClient metaClient) { + return setTableType(metaClient.getTableType()) +.setTableName(metaClient.getTableConfig().getTableName()) +.setArchiveLogFolder(metaClient.getArchivePath()) +.setPayloadClassName(metaClient.getTableConfig().getPayloadClass()); +} + +public Properties build() { Review comment: It seems we do not call this method out of the class. The major purpose of the inner class is to build a `HoodieTableMetaClient ` object? If yes, renaming to be the table mete client builder sounds more reasonable? ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -258,4 +260,142 @@ public String getArchivelogFolder() { public Properties getProperties() { return props; } + + public static PropertyBuilder propertyBuilder() { +return new PropertyBuilder(); + } + + public static class PropertyBuilder { + +private HoodieTableType tableType; + +private String tableName; + +private String archiveLogFolder; + +private String payloadClassName; + +private Integer timelineLayoutVersion; + +private String baseFileFormat; + +private String preCombineField; + +private String bootstrapIndexClass; + +private String bootstrapBasePath; + +private PropertyBuilder() { + +} + +public PropertyBuilder setTableType(HoodieTableType tableType) { + this.tableType = tableType; + return this; +} + +public PropertyBuilder setTableType(String tableType) { + return setTableType(HoodieTableType.valueOf(tableType)); +} + +public PropertyBuilder setTableName(String tableName) { + this.tableName = tableName; + return this; +} + +public PropertyBuilder setArchiveLogFolder(String archiveLogFolder) { + this.archiveLogFolder = archiveLogFolder; +
[GitHub] [hudi] nbalajee opened a new pull request #2607: [HUDI-1643] Hudi observability - framework to report stats from execu…
nbalajee opened a new pull request #2607: URL: https://github.com/apache/hudi/pull/2607 …tors ## What is the purpose of the pull request Frame work for collecting Hudi Observability stats from the executors. ## Brief change log - Using distributed registry, report stats from the executors to the driver, to be published using the Graphite reporter. - Report Hudi Write stage performance stats. - Report Hudi BoomIndex stage stats. ## Verify this pull request This change added tests and can be verified as follows: - Added a unit testcase testObservabilityMetricsOnCOW - Manually verified the change by running a job locally. ## Committer checklist - [ x] Has a corresponding JIRA in PR title & commit - [ x] Commit message is descriptive of the change - [ x] CI is green - [x ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a change in pull request #2541: [HUDI-1587] Add latency and freshness support
xushiyan commented on a change in pull request #2541: URL: https://github.com/apache/hudi/pull/2541#discussion_r583403910 ## File path: hudi-common/src/main/java/org/apache/hudi/common/util/DateTimeUtils.java ## @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi.common.util; + +import java.time.Instant; +import java.time.format.DateTimeParseException; + +public class DateTimeUtils { + + /** + * Parse input String to a {@link java.time.Instant}. + * @param s Input String should be Epoch time in millisecond or ISO-8601 format. + */ + public static Instant parseDateTime(String s) throws DateTimeParseException { Review comment: @vinothchandar not quite get it... you meant put it in another package or move the method to another util class? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: Travis CI build asf-site
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new c5d50f0 Travis CI build asf-site c5d50f0 is described below commit c5d50f09884f2336dd4d512a87213dcc2c5e57b8 Author: CI AuthorDate: Fri Feb 26 06:05:12 2021 + Travis CI build asf-site --- content/docs/powered_by.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/powered_by.html b/content/docs/powered_by.html index 44b80df..a92db54 100644 --- a/content/docs/powered_by.html +++ b/content/docs/powered_by.html @@ -483,7 +483,7 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA https://www.meetup.com/UberEvents/events/274924537/;>“Meetup talk by Nishith Agarwal” - Uber Data Platforms Meetup, Dec 2020 -https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;>“Apache Hudi learning series: Understanding Hudi internals By Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal” - Uber Meetup, Virtual, Feb 2021 +https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;>“Apache Hudi learning series: Understanding Hudi internals” - By Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal, Feb 2021, Uber Meetup
[jira] [Created] (HUDI-1643) [Hudi Observability] Framework for reporting stats from executors
Balajee Nagasubramaniam created HUDI-1643: - Summary: [Hudi Observability] Framework for reporting stats from executors Key: HUDI-1643 URL: https://issues.apache.org/jira/browse/HUDI-1643 Project: Apache Hudi Issue Type: New Feature Components: Common Core Reporter: Balajee Nagasubramaniam Hudi Observability framework to report stats from executors, using the distributed registry. - Hudi Write Stage Performance stats. - Bloom Index stage stats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch asf-site updated: [MINOR] Fixing slideshare link (#2606)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 1dd7a41 [MINOR] Fixing slideshare link (#2606) 1dd7a41 is described below commit 1dd7a4164972a733bdb49e3e9f7fdbbddac571d3 Author: n3nash AuthorDate: Thu Feb 25 22:02:59 2021 -0800 [MINOR] Fixing slideshare link (#2606) --- docs/_docs/1_4_powered_by.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/_docs/1_4_powered_by.md b/docs/_docs/1_4_powered_by.md index ada616e..9f6694b 100644 --- a/docs/_docs/1_4_powered_by.md +++ b/docs/_docs/1_4_powered_by.md @@ -145,7 +145,7 @@ Meanwhile, we build a set of data access standards based on Hudi, which provides 21. ["Meetup talk by Nishith Agarwal"](https://www.meetup.com/UberEvents/events/274924537/) - Uber Data Platforms Meetup, Dec 2020 -22. ["Apache Hudi learning series: Understanding Hudi internals By Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal"]("https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;) - Uber Meetup, Virtual, Feb 2021 +22. ["Apache Hudi learning series: Understanding Hudi internals"](https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities) - By Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal, Feb 2021, Uber Meetup ## Articles
[GitHub] [hudi] vinothchandar merged pull request #2606: [MINOR] Fixing slideshare link
vinothchandar merged pull request #2606: URL: https://github.com/apache/hudi/pull/2606 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2584: [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on re
satishkotha commented on a change in pull request #2584: URL: https://github.com/apache/hudi/pull/2584#discussion_r583398027 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatReader.java ## @@ -104,7 +104,7 @@ public boolean hasNext() { throw new HoodieIOException("unable to initialize read with log file ", io); } LOG.info("Moving to the next reader for logfile " + currentReader.getLogFile()); - return this.currentReader.hasNext(); + return this.currentReader.hasNext() || hasNext(); Review comment: can we just call hasNext() here to simplify? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…
pengzhiwei2018 commented on a change in pull request #2596: URL: https://github.com/apache/hudi/pull/2596#discussion_r583385132 ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java ## @@ -106,10 +106,13 @@ public String createTable( throw new IllegalStateException("Table already existing in path : " + path); } -final HoodieTableType tableType = HoodieTableType.valueOf(tableTypeStr); -HoodieTableMetaClient.initTableType(HoodieCLI.conf, path, tableType, name, archiveFolder, -payloadClass, layoutVersion); - +new HoodieTableConfig.PropertyBuilder() Review comment: Hi @yanghua , I have update the PR. Please take a look again when you have time. Thanks~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #2606: [MINOR] Fixing slideshare link
n3nash commented on pull request #2606: URL: https://github.com/apache/hudi/pull/2606#issuecomment-786408716 Verified locally that it works. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash opened a new pull request #2606: [MINOR] Fixing slideshare link
n3nash opened a new pull request #2606: URL: https://github.com/apache/hudi/pull/2606 Fixing broken link ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2600: [HUDI-1638] Some improvements to BucketAssignFunction
danny0405 commented on a change in pull request #2600: URL: https://github.com/apache/hudi/pull/2600#discussion_r583369329 ## File path: hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java ## @@ -136,15 +137,10 @@ public void open(Configuration parameters) throws Exception { new SerializableConfiguration(this.hadoopConf), new FlinkTaskContextSupplier(getRuntimeContext())); this.bucketAssigner = new BucketAssigner(context, writeConfig); -List allPartitionPaths = FSUtils.getAllPartitionPaths(this.context, -this.conf.getString(FlinkOptions.PATH), false, false, false); -final int parallelism = getRuntimeContext().getNumberOfParallelSubtasks(); -final int maxParallelism = getRuntimeContext().getMaxNumberOfParallelSubtasks(); -final int taskID = getRuntimeContext().getIndexOfThisSubtask(); -// reference: org.apache.flink.streaming.api.datastream.KeyedStream -this.initialPartitionsToLoad = allPartitionPaths.stream() -.filter(partition -> KeyGroupRangeAssignment.assignKeyToParallelOperator(partition, maxParallelism, parallelism) == taskID) -.collect(Collectors.toList()); + +// initialize and check the partitions load state +loadInitialPartitions(); +checkPartitionsLoaded(); Review comment: Sure, feel free to promote it, thanks ~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hk-lrzy commented on a change in pull request #2600: [HUDI-1638] Some improvements to BucketAssignFunction
hk-lrzy commented on a change in pull request #2600: URL: https://github.com/apache/hudi/pull/2600#discussion_r583366344 ## File path: hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java ## @@ -136,15 +137,10 @@ public void open(Configuration parameters) throws Exception { new SerializableConfiguration(this.hadoopConf), new FlinkTaskContextSupplier(getRuntimeContext())); this.bucketAssigner = new BucketAssigner(context, writeConfig); -List allPartitionPaths = FSUtils.getAllPartitionPaths(this.context, -this.conf.getString(FlinkOptions.PATH), false, false, false); -final int parallelism = getRuntimeContext().getNumberOfParallelSubtasks(); -final int maxParallelism = getRuntimeContext().getMaxNumberOfParallelSubtasks(); -final int taskID = getRuntimeContext().getIndexOfThisSubtask(); -// reference: org.apache.flink.streaming.api.datastream.KeyedStream -this.initialPartitionsToLoad = allPartitionPaths.stream() -.filter(partition -> KeyGroupRangeAssignment.assignKeyToParallelOperator(partition, maxParallelism, parallelism) == taskID) -.collect(Collectors.toList()); + +// initialize and check the partitions load state +loadInitialPartitions(); +checkPartitionsLoaded(); Review comment: I have some doubts about this, because the current key of the keycontext has not been set, so the key state of flink cannot be accessed in the open method. Should we move this method to processElement? If possible, I can submit a patch. thanks. ``` private void checkKeyNamespacePreconditions(K key, N namespace) { Preconditions.checkNotNull(key, "No key set. This method should not be called outside of a keyed context."); Preconditions.checkNotNull(namespace, "Provided namespace is null."); } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hk-lrzy closed pull request #2604: [hudi-1639][hudi-flink] fix BucketAssigner npe
hk-lrzy closed pull request #2604: URL: https://github.com/apache/hudi/pull/2604 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: Travis CI build asf-site
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 1f101fe Travis CI build asf-site 1f101fe is described below commit 1f101fe71756e5210bcffd63d5c3e243829ce1cd Author: CI AuthorDate: Fri Feb 26 03:14:49 2021 + Travis CI build asf-site --- content/docs/powered_by.html | 4 1 file changed, 4 insertions(+) diff --git a/content/docs/powered_by.html b/content/docs/powered_by.html index 3f88d68..44b80df 100644 --- a/content/docs/powered_by.html +++ b/content/docs/powered_by.html @@ -482,6 +482,9 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA https://www.meetup.com/UberEvents/events/274924537/;>“Meetup talk by Nishith Agarwal” - Uber Data Platforms Meetup, Dec 2020 + +https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;>“Apache Hudi learning series: Understanding Hudi internals By Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal” - Uber Meetup, Virtual, Feb 2021 + Articles @@ -505,6 +508,7 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/;>“Can Big Data Solutions Be Affordable?” https://www.alluxio.io/blog/building-high-performance-data-lake-using-apache-hudi-and-alluxio-at-t3go/;>“Building High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go” https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b;>“Data Lake Change Capture using Apache Hudi Amazon AMS/EMR Part 2” + https://eng.uber.com/apache-hudi-graduation/;>“Building a large scale transactional data lake at Uber using Apache Hudi” - Engineering Blog By Nishith Agarwal Powered by
[GitHub] [hudi] codejoyan commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?
codejoyan commented on issue #2592: URL: https://github.com/apache/hudi/issues/2592#issuecomment-786379571 This is the spark version of the cluster being used at work so I will have to use Spark 2.3 until there is an upgrade. Since the documentation says: **Hudi works with Spark-2.x**, I was wondering there would be some workaround to this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links (#2602)
This is an automated email from the ASF dual-hosted git repository. lamberken pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new eeb146a [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links (#2602) eeb146a is described below commit eeb146a8fa3a368cd329de175e8c803c46116826 Author: n3nash AuthorDate: Thu Feb 25 19:06:06 2021 -0800 [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links (#2602) --- docs/_docs/1_4_powered_by.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/_docs/1_4_powered_by.md b/docs/_docs/1_4_powered_by.md index 6bac956..ada616e 100644 --- a/docs/_docs/1_4_powered_by.md +++ b/docs/_docs/1_4_powered_by.md @@ -145,6 +145,8 @@ Meanwhile, we build a set of data access standards based on Hudi, which provides 21. ["Meetup talk by Nishith Agarwal"](https://www.meetup.com/UberEvents/events/274924537/) - Uber Data Platforms Meetup, Dec 2020 +22. ["Apache Hudi learning series: Understanding Hudi internals By Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal"]("https://www.slideshare.net/NishithAgarwal3/hudi-architecture-fundamentals-and-capabilities;) - Uber Meetup, Virtual, Feb 2021 + ## Articles You can check out [our blog pages](https://hudi.apache.org/blog.html) for content written by our committers/contributors. @@ -165,6 +167,7 @@ You can check out [our blog pages](https://hudi.apache.org/blog.html) for conten 14. ["Can Big Data Solutions Be Affordable?"](https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/) 15. ["Building High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go"](https://www.alluxio.io/blog/building-high-performance-data-lake-using-apache-hudi-and-alluxio-at-t3go/) 16. ["Data Lake Change Capture using Apache Hudi & Amazon AMS/EMR Part 2"](https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-apache-hudi-on-amazon-emr-part-2-process-65e4662d7b4b) +17. ["Building a large scale transactional data lake at Uber using Apache Hudi"](https://eng.uber.com/apache-hudi-graduation/) - Engineering Blog By Nishith Agarwal ## Powered by
[GitHub] [hudi] lamber-ken merged pull request #2602: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links
lamber-ken merged pull request #2602: URL: https://github.com/apache/hudi/pull/2602 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lamber-ken edited a comment on pull request #2602: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links
lamber-ken edited a comment on pull request #2602: URL: https://github.com/apache/hudi/pull/2602#issuecomment-786378892 Thanks @n3nash LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lamber-ken commented on pull request #2602: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links
lamber-ken commented on pull request #2602: URL: https://github.com/apache/hudi/pull/2602#issuecomment-786378892 Thanks @n3nash This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] lamber-ken commented on a change in pull request #2602: [HUDI 1642] Adding Hudi Learning series presentation & Uber eng blog links
lamber-ken commented on a change in pull request #2602: URL: https://github.com/apache/hudi/pull/2602#discussion_r583352619 ## File path: content/docs/0.5.3-powered_by.html ## @@ -462,13 +462,17 @@ Talks Presentations https://drive.google.com/open?id=1Pk_WdFxfEZxMMfAOn0R8-m3ALkcN6G9e;>“Building a near real-time, high-performance data warehouse based on Apache Hudi and Apache Kylin” - By ShaoFeng Shi March 2020, Apache Hudi Apache Kylin Online Meetup, China + +https://drive.google.com/file/d/1K-WsQAQf-C96pLRTpM4Ha1v7wRf1pZ9S/view?usp=sharing;>“Apache Hudi learning series: Understanding Hudi internals” - By Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal Feb 2021, Uber Meetup, Virtual Review comment: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…
codecov-io edited a comment on pull request #2596: URL: https://github.com/apache/hudi/pull/2596#issuecomment-784717451 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2596?src=pr=h1) Report > Merging [#2596](https://codecov.io/gh/apache/hudi/pull/2596?src=pr=desc) (c71fe74) into [master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc) (43a0776) will **increase** coverage by `18.41%`. > The diff coverage is `69.56%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2596/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2596?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2596 +/- ## = + Coverage 51.14% 69.55% +18.41% + Complexity 3215 363 -2852 = Files 438 53 -385 Lines 20041 1961-18080 Branches 2064 235 -1829 = - Hits 10250 1364 -8886 + Misses 8946 463 -8483 + Partials845 134 -711 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.55% <69.56%> (+0.09%)` | `0.00 <0.00> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2596?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.00% <50.00%> (ø)` | `52.00 <0.00> (+2.00)` | | | [...udi/utilities/deltastreamer/BootstrapExecutor.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvQm9vdHN0cmFwRXhlY3V0b3IuamF2YQ==) | `82.35% <100.00%> (+2.80%)` | `6.00 <0.00> (ø)` | | | [...hudi/utilities/sources/helpers/KafkaOffsetGen.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvaGVscGVycy9LYWZrYU9mZnNldEdlbi5qYXZh) | `85.84% <0.00%> (-2.94%)` | `20.00% <0.00%> (+4.00%)` | :arrow_down: | | [...apache/hudi/timeline/service/handlers/Handler.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvSGFuZGxlci5qYXZh) | | | | | [.../spark/sql/hudi/streaming/HoodieStreamSource.scala](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvaHVkaS9zdHJlYW1pbmcvSG9vZGllU3RyZWFtU291cmNlLnNjYWxh) | | | | | [.../java/org/apache/hudi/common/util/HoodieTimer.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvSG9vZGllVGltZXIuamF2YQ==) | | | | | [...i/common/util/collection/ExternalSpillableMap.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvY29sbGVjdGlvbi9FeHRlcm5hbFNwaWxsYWJsZU1hcC5qYXZh) | | | | | [...g/apache/hudi/common/table/log/LogReaderUtils.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Mb2dSZWFkZXJVdGlscy5qYXZh) | | | | | [...n/java/org/apache/hudi/common/HoodieCleanStat.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL0hvb2RpZUNsZWFuU3RhdC5qYXZh) | | | | | [...del/OverwriteNonDefaultsWithLatestAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZU5vbkRlZmF1bHRzV2l0aExhdGVzdEF2cm9QYXlsb2FkLmphdmE=) | | | | | ... and [374 more](https://codecov.io/gh/apache/hudi/pull/2596/diff?src=pr=tree-more) | | This is an automated message from the Apache Git
[GitHub] [hudi] garyli1019 commented on issue #2498: [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]
garyli1019 commented on issue #2498: URL: https://github.com/apache/hudi/issues/2498#issuecomment-786364265 I am seeing the same problem when the compiled spark distribution is different from the runtime spark distribution. Compile hudi jar against the runtime spark distribution should fix this problem. @green2k @andormarkus This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] garyli1019 commented on pull request #2584: [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read.
garyli1019 commented on pull request #2584: URL: https://github.com/apache/hudi/pull/2584#issuecomment-786361769 hi @satishkotha , this PR seems related to #2583 , would you take a look? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…
pengzhiwei2018 commented on a change in pull request #2596: URL: https://github.com/apache/hudi/pull/2596#discussion_r583334728 ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java ## @@ -106,10 +106,13 @@ public String createTable( throw new IllegalStateException("Table already existing in path : " + path); } -final HoodieTableType tableType = HoodieTableType.valueOf(tableTypeStr); -HoodieTableMetaClient.initTableType(HoodieCLI.conf, path, tableType, name, archiveFolder, -payloadClass, layoutVersion); - +new HoodieTableConfig.PropertyBuilder() Review comment: Sound good! I will have a try. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #2596: [HUDI-1636] Support Builder Pattern To Build Table Properties For Hoo…
yanghua commented on a change in pull request #2596: URL: https://github.com/apache/hudi/pull/2596#discussion_r583313615 ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/TableCommand.java ## @@ -106,10 +106,13 @@ public String createTable( throw new IllegalStateException("Table already existing in path : " + path); } -final HoodieTableType tableType = HoodieTableType.valueOf(tableTypeStr); -HoodieTableMetaClient.initTableType(HoodieCLI.conf, path, tableType, name, archiveFolder, -payloadClass, layoutVersion); - +new HoodieTableConfig.PropertyBuilder() Review comment: WDYT about adding a static method named e.g. `HoodieTableConfig.propertyBuilder()` so that we can make the code more readable? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash merged pull request #2565: [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap.
n3nash merged pull request #2565: URL: https://github.com/apache/hudi/pull/2565 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (#2565)
This is an automated email from the ASF dual-hosted git repository. nagarwal pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 022df0d [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (#2565) 022df0d is described below commit 022df0d1b134422f7b6f305cd7ec04b25caa23f0 Author: Prashant Wason AuthorDate: Thu Feb 25 16:52:28 2021 -0800 [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (#2565) --- .../metadata/HoodieBackedTableMetadataWriter.java | 6 ++ .../hudi/metadata/TestHoodieBackedMetadata.java | 19 +-- .../hudi/common/config/HoodieMetadataConfig.java | 15 +++ 3 files changed, 38 insertions(+), 2 deletions(-) diff --git a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java index 003ec7d..5aae7b7 100644 --- a/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java +++ b/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java @@ -318,6 +318,7 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableMeta Map> partitionToFileStatus = new HashMap<>(); final int fileListingParallelism = metadataWriteConfig.getFileListingParallelism(); SerializableConfiguration conf = new SerializableConfiguration(datasetMetaClient.getHadoopConf()); +final String dirFilterRegex = datasetWriteConfig.getMetadataConfig().getDirectoryFilterRegex(); while (!pathsToList.isEmpty()) { int listingParallelism = Math.min(fileListingParallelism, pathsToList.size()); @@ -331,6 +332,11 @@ public abstract class HoodieBackedTableMetadataWriter implements HoodieTableMeta // If the listing reveals a directory, add it to queue. If the listing reveals a hoodie partition, add it to // the results. dirToFileListing.forEach(p -> { +if (!dirFilterRegex.isEmpty() && p.getLeft().getName().matches(dirFilterRegex)) { + LOG.info("Ignoring directory " + p.getLeft() + " which matches the filter regex " + dirFilterRegex); + return; +} + List filesInDir = Arrays.stream(p.getRight()).parallel() .filter(fs -> !fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE)) .collect(Collectors.toList()); diff --git a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java index 3697ec1..4fa0bc8 100644 --- a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java +++ b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java @@ -148,14 +148,22 @@ public class TestHoodieBackedMetadata extends HoodieClientTestHarness { final String nonPartitionDirectory = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[0] + "-nonpartition"; Files.createDirectories(Paths.get(basePath, nonPartitionDirectory)); +// Three directories which are partitions but will be ignored due to filter +final String filterDirRegex = ".*-filterDir\\d|\\..*"; +final String filteredDirectoryOne = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[0] + "-filterDir1"; +final String filteredDirectoryTwo = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[0] + "-filterDir2"; +final String filteredDirectoryThree = ".backups"; + // Create some commits HoodieTestTable testTable = HoodieTestTable.of(metaClient); -testTable.withPartitionMetaFiles("p1", "p2") +testTable.withPartitionMetaFiles("p1", "p2", filteredDirectoryOne, filteredDirectoryTwo, filteredDirectoryThree) .addCommit("001").withBaseFilesInPartition("p1", 10).withBaseFilesInPartition("p2", 10, 10) .addCommit("002").withBaseFilesInPartition("p1", 10).withBaseFilesInPartition("p2", 10, 10, 10) .addInflightCommit("003").withBaseFilesInPartition("p1", 10).withBaseFilesInPartition("p2", 10); -try (SparkRDDWriteClient client = new SparkRDDWriteClient(engineContext, getWriteConfig(true, true))) { +final HoodieWriteConfig writeConfig = getWriteConfigBuilder(true, true, false) + .withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(true).withDirectoryFilterRegex(filterDirRegex).build()).build(); +try (SparkRDDWriteClient client = new SparkRDDWriteClient(engineContext, writeConfig)) { client.startCommitWithTime("005"); List partitions =
[hudi] branch master updated: Fixing README for hudi test suite long running job (#2578)
This is an automated email from the ASF dual-hosted git repository. nagarwal pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 9f5e8cc Fixing README for hudi test suite long running job (#2578) 9f5e8cc is described below commit 9f5e8cc7c3789fd3658f52f960fb5cbc8e4efce9 Author: Sivabalan Narayanan AuthorDate: Thu Feb 25 19:50:18 2021 -0500 Fixing README for hudi test suite long running job (#2578) --- hudi-integ-test/README.md | 92 +-- 1 file changed, 58 insertions(+), 34 deletions(-) diff --git a/hudi-integ-test/README.md b/hudi-integ-test/README.md index ff64ed1..06de263 100644 --- a/hudi-integ-test/README.md +++ b/hudi-integ-test/README.md @@ -270,20 +270,31 @@ spark-submit \ --compact-scheduling-minshare 1 ``` -For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner. -Hence "ValidateDatasetNode" is introduced which will read entire input data and compare it with hudi contents both via -spark datasource and hive table via spark sql engine. +## Running long running test suite in Local Docker environment + +For long running test suite, validation has to be done differently. Idea is to run same dag in a repeated manner for +N iterations. Hence "ValidateDatasetNode" is introduced which will read entire input data and compare it with hudi +contents both via spark datasource and hive table via spark sql engine. Hive validation is configurable. If you have "ValidateDatasetNode" in your dag, do not replace hive jars as instructed above. Spark sql engine does not -go well w/ hive2* jars. So, after running docker setup, just copy test.properties and your dag of interest and you are -good to go ahead. +go well w/ hive2* jars. So, after running docker setup, follow the below steps. +``` +docker cp packaging/hudi-integ-test-bundle/target/hudi-integ-test-bundle-0.8.0-SNAPSHOT.jar adhoc-2:/opt/ +docker cp demo/config/test-suite/test.properties adhoc-2:/opt/ +``` +Also copy your dag of interest to adhoc-2:/opt/ +``` +docker cp demo/config/test-suite/complex-dag-cow.yaml adhoc-2:/opt/ +``` For repeated runs, two additional configs need to be set. "dag_rounds" and "dag_intermittent_delay_mins". -This means that your dag will be repeated for N times w/ a delay of Y mins between each round. +This means that your dag will be repeated for N times w/ a delay of Y mins between each round. Note: complex-dag-cow.yaml +already has all these configs set. So no changes required just to try it out. + +Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data" set to true or without +setting the config. When "delete_input_data" is set for ValidateDatasetNode, once validation is complete, entire input +data will be deleted. So, suggestion is to use this ValidateDatasetNode as the last node in the dag with "delete_input_data". -Also, ValidateDatasetNode can be configured in two ways. Either with "delete_input_data: true" set or not set. -When "delete_input_data" is set for ValidateDatasetNode, once validation is complete, entire input data will be deleted. -So, suggestion is to use this ValidateDatasetNode as the last node in the dag with "delete_input_data". Example dag: ``` Insert @@ -294,7 +305,7 @@ Example dag: If above dag is run with "dag_rounds" = 10 and "dag_intermittent_delay_mins" = 10, then this dag will run for 10 times with 10 mins delay between every run. At the end of every run, records written as part of this round will be validated. At the end of each validation, all contents of input are deleted. -For eg: incase of above dag, +To illustrate each round ``` Round1: insert => inputPath/batch1 @@ -323,6 +334,12 @@ every cycle. Lets see an example where you don't set "delete_input_data" as part of Validation. ``` + Insert + Upsert + ValidateDatasetNode +``` +Here is the illustration of each round +``` Round1: insert => inputPath/batch1 upsert -> inputPath/batch2 @@ -382,27 +399,14 @@ Above dag was just an example for illustration purposes. But you can make it com Validate w/o deleting Upsert Validate w/ deletion -``` -With this dag, you can set the two additional configs "dag_rounds" and "dag_intermittent_delay_mins" and have a long -running test suite. - ``` -dag_rounds: 1 -dag_intermittent_delay_mins: 10 -dag_content: -Insert -Upsert -Delete -Validate w/o deleting -Insert -Rollback -Validate w/o deleting -Upsert -Validate w/ deletion +Once you have copied the jar, test.properties and your dag to adhoc-2:/opt/, you can run the following command to execute +the test suite job. ``` - -Sample COW command with repeated runs. +docker exec -it adhoc-2 /bin/bash +``` +Sample
[GitHub] [hudi] n3nash merged pull request #2578: [MINOR] Fixing Hudi Test suite readme for long running job
n3nash merged pull request #2578: URL: https://github.com/apache/hudi/pull/2578 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1642) Add Links to Uber engineering blog and meet up slides
Nishith Agarwal created HUDI-1642: - Summary: Add Links to Uber engineering blog and meet up slides Key: HUDI-1642 URL: https://issues.apache.org/jira/browse/HUDI-1642 Project: Apache Hudi Issue Type: Task Components: Docs Reporter: Nishith Agarwal Assignee: Nishith Agarwal -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] n3nash commented on a change in pull request #2602: Adding Hudi Learning series presentation & Uber eng blog links
n3nash commented on a change in pull request #2602: URL: https://github.com/apache/hudi/pull/2602#discussion_r583286829 ## File path: content/docs/0.5.3-powered_by.html ## @@ -462,13 +462,17 @@ Talks Presentations https://drive.google.com/open?id=1Pk_WdFxfEZxMMfAOn0R8-m3ALkcN6G9e;>“Building a near real-time, high-performance data warehouse based on Apache Hudi and Apache Kylin” - By ShaoFeng Shi March 2020, Apache Hudi Apache Kylin Online Meetup, China + +https://drive.google.com/file/d/1K-WsQAQf-C96pLRTpM4Ha1v7wRf1pZ9S/view?usp=sharing;>“Apache Hudi learning series: Understanding Hudi internals” - By Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal Feb 2021, Uber Meetup, Virtual Review comment: Thanks @lamber-ken for pointing that out. I've made those changes, can you take a look ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] afeldman1 closed issue #2399: [SUPPORT] Hudi deletes not being properly commited
afeldman1 closed issue #2399: URL: https://github.com/apache/hudi/issues/2399 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] afeldman1 commented on issue #2399: [SUPPORT] Hudi deletes not being properly commited
afeldman1 commented on issue #2399: URL: https://github.com/apache/hudi/issues/2399#issuecomment-786204481 Apologies for the delayed response. And thank you to @bvaradar for the initial hint. The issue turned out to be caused not by the keys but another one of the configuration properties. When the table was originally created there was a field specified for the "hoodie.datasource.write.precombine.field" property. When doing the delete operation, it did not seem to make sense to have to specify this field, so I left it out of the config. It seems if the table was created with this property set, all future operations on it are required to specify the same value for this property. I'm not sure if this requirement necessarily makes sense? In terms of @nsivabalan 's questions: This was not related to hudi versions. This was found before releasing this piece to production, but seems to always act the same way. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kpurella commented on issue #2240: [SUPPORT] Performance Issue : HUDI MOR ,UPSERT Job running forever
kpurella commented on issue #2240: URL: https://github.com/apache/hudi/issues/2240#issuecomment-786183675 @vinothchandar Sure ,i will !! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] toninis commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer
toninis commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-786183618 @vinothchandar Sorry I took so long to respond . It had worked and compiled successfully . I probably had missed something at the time . Thanks for your response at the time . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1641) Issue for Integrating Hudi with Kafka using Avro Schema
[ https://issues.apache.org/jira/browse/HUDI-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PRASHANT BHOSALE updated HUDI-1641: --- Description: I am trying to integrate Hudi with Kafka topic. teps followed : # Created Kafka topic in Confluent with schema defined in schema registry. # Using kafka-avro-console-producer, I am trying to produce data. # Running Hudi Delta Streamer in continuous mode to consume the data. Infrastructure : # AWS EMR # Spark 2.4.4 # Hudi Utility ( Tried with 0.6.0 and 0.7.0 ) # Avro ( Tried avro-1.8.2, avro-1.9.2 and avro-1.10.0 ) {code:java} 21/02/24 13:02:08 ERROR TaskResultGetter: Exception while getting task result org.apache.spark.SparkException: Error reading attempting to read avro data -- encountered an unknown fingerprint: 103427103938146401, not sure what schema to use. This could happen if you registered additional schemas after starting your spark context. at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$4.apply(GenericAvroSerializer.scala:141) at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$4.apply(GenericAvroSerializer.scala:138) at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:79) at org.apache.spark.serializer.GenericAvroSerializer.deserializeDatum(GenericAvroSerializer.scala:137) at org.apache.spark.serializer.GenericAvroSerializer.read(GenericAvroSerializer.scala:162) at org.apache.spark.serializer.GenericAvroSerializer.read(GenericAvroSerializer.scala:47) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:371) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945) at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/02/24 13:02:08 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 21/02/24 13:02:08 INFO YarnScheduler: Cancelling stage 1 21/02/24 13:02:08 INFO YarnScheduler: Killing all running tasks in stage 1: Stage cancelled 21/02/24 13:02:08 INFO DAGScheduler: ResultStage 1 (isEmpty at DeltaSync.java:380) failed in 1.415 s due to Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error reading attempting to read avro data -- encountered an unknown fingerprint: 103427103938146401, not sure what schema to use. This could happen if you registered additional schemas after starting your spark context. 21/02/24 13:02:08 INFO DAGScheduler: Job 5 failed: isEmpty at DeltaSync.java:380, took 1.422265 s 21/02/24 13:02:08 ERROR HoodieDeltaStreamer: Shutting down delta-sync due to exception org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error reading attempting to read avro data -- encountered an unknown fingerprint: 103427103938146401, not sure what schema to use. This could happen if you registered additional schemas after starting your spark context. at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at scala.Option.foreach(Option.scala:257) at
[GitHub] [hudi] t0il3ts0ap edited a comment on issue #2589: [SUPPORT] Issue with adding column while running deltastreamer with kafka source.
t0il3ts0ap edited a comment on issue #2589: URL: https://github.com/apache/hudi/issues/2589#issuecomment-786029402 @satishkotha Ran again on fresh table, still same issue. SparkSubmit: ``` spark-submit --master yarn --packages org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-utilities-bundle_2.12:0.7.0 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.scheduler.mode=FAIR --conf spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.executor.heartbeatInterval=120s --conf spark.network.timeout=600s --conf spark.yarn.max.executor.failures=5 --conf spark.sql.catalogImplementation=hive --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties" --deploy-mode client s3://navi-emr-poc/delta-streamer-test/jars/deltastreamer-addons-1.0-SNAPSHOT.jar --enable-sync --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor --hoodie-conf hoodie.parquet.compression.codec=snappy --hoodie-conf partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor --hoodie-conf hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable=false --hoodie-conf auto.offset.reset=latest --hoodie-conf hoodie.avro.schema.validate=true --table-type MERGE_ON_READ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --schemaprovider-class org.apache.hudi.utilities.schema.NullTargetSchemaRegistryProvider --props s3://navi-emr-poc/delta-streamer-test/config/kafka-source-nonprod.properties --transformer-class com.navi.transform.DebeziumTransformer --continuous --hoodie-conf hoodie.deltastreamer.schemaprovider.registry.url=https://dev-dataplatform-schema-registry.np.navi-tech.in/subjects/to_be_deleted_service.public.accounts-value/versions/latest --hoodie-conf hoodie.datasource.hive_sync.database=to_be_deleted_service --hoodie-conf hoodie.datasource.hive_sync.table=accounts --hoodie-conf hoodie.datasource.write.recordkey.field=id --hoodie-conf hoodie.datasource.write.precombine.field=__lsn --hoodie-conf hoodie.datasource.write.partitionpath.field='' --hoodie-conf hoodie.deltastreamer.source.kafka.topic=to_be_deleted_service.public.accounts --hoodie-conf group.id=delta-streamer-to_be_deleted_service-accounts --source-ordering-field __lsn --target-base-path s3://navi-emr-poc/raw-data/to_be_deleted_service/accounts --target-table accounts ``` Transformer Code: ``` public class DebeziumTransformer implements Transformer { public Dataset apply(JavaSparkContext javaSparkContext, SparkSession sparkSession, Dataset dataset, TypedProperties typedProperties) { Dataset transformedDataset = dataset .withColumn("__deleted", dataset.col("__deleted").cast(DataTypes.BooleanType)) .withColumnRenamed("__deleted", "_hoodie_is_deleted") .drop("__op", "__source_ts_ms"); log.info("TRANSFORMER SCHEMA STARTS"); transformedDataset.printSchema(); transformedDataset.show(); log.info("TRANSFORMER SCHEMA ENDS"); return transformedDataset; } } ``` When I add the column, debezium updates the schema registry instantaneously and new records start flowing. Its possible that deltastreamer gets the new schema records before even hitting schema registry. Attaching logs: [alog.txt](https://github.com/apache/hudi/files/6044367/alog.txt) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] t0il3ts0ap commented on issue #2589: [SUPPORT] Issue with adding column while running deltastreamer with kafka source.
t0il3ts0ap commented on issue #2589: URL: https://github.com/apache/hudi/issues/2589#issuecomment-786029402 Ran again on fresh table, still same issue. SparkSubmit: ``` spark-submit --master yarn --packages org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-utilities-bundle_2.12:0.7.0 --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf spark.scheduler.mode=FAIR --conf spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.executor.heartbeatInterval=120s --conf spark.network.timeout=600s --conf spark.yarn.max.executor.failures=5 --conf spark.sql.catalogImplementation=hive --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=/home/hadoop/log4j-spark.properties" --deploy-mode client s3://navi-emr-poc/delta-streamer-test/jars/deltastreamer-addons-1.0-SNAPSHOT.jar --enable-sync --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor --hoodie-conf hoodie.parquet.compression.codec=snappy --hoodie-conf partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor --hoodie-conf hoodie.deltastreamer.schemaprovider.spark_avro_post_processor.enable=false --hoodie-conf auto.offset.reset=latest --hoodie-conf hoodie.avro.schema.validate=true --table-type MERGE_ON_READ --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --schemaprovider-class org.apache.hudi.utilities.schema.NullTargetSchemaRegistryProvider --props s3://navi-emr-poc/delta-streamer-test/config/kafka-source-nonprod.properties --transformer-class com.navi.transform.DebeziumTransformer --continuous --hoodie-conf hoodie.deltastreamer.schemaprovider.registry.url=https://dev-dataplatform-schema-registry.np.navi-tech.in/subjects/to_be_deleted_service.public.accounts-value/versions/latest --hoodie-conf hoodie.datasource.hive_sync.database=to_be_deleted_service --hoodie-conf hoodie.datasource.hive_sync.table=accounts --hoodie-conf hoodie.datasource.write.recordkey.field=id --hoodie-conf hoodie.datasource.write.precombine.field=__lsn --hoodie-conf hoodie.datasource.write.partitionpath.field='' --hoodie-conf hoodie.deltastreamer.source.kafka.topic=to_be_deleted_service.public.accounts --hoodie-conf group.id=delta-streamer-to_be_deleted_service-accounts --source-ordering-field __lsn --target-base-path s3://navi-emr-poc/raw-data/to_be_deleted_service/accounts --target-table accounts ``` Transformer Code: ``` public class DebeziumTransformer implements Transformer { public Dataset apply(JavaSparkContext javaSparkContext, SparkSession sparkSession, Dataset dataset, TypedProperties typedProperties) { Dataset transformedDataset = dataset .withColumn("__deleted", dataset.col("__deleted").cast(DataTypes.BooleanType)) .withColumnRenamed("__deleted", "_hoodie_is_deleted") .drop("__op", "__source_ts_ms"); log.info("TRANSFORMER SCHEMA STARTS"); transformedDataset.printSchema(); transformedDataset.show(); log.info("TRANSFORMER SCHEMA ENDS"); return transformedDataset; } } ``` When I add the column, debezium updates the schema registry instantaneously and new records start flowing. Its possible that deltastreamer gets the new schema records before even hitting schema registry. Attaching logs: [alog.txt](https://github.com/apache/hudi/files/6044367/alog.txt) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-1269) Make whether the failure of connect hive affects hudi ingest process configurable
[ https://issues.apache.org/jira/browse/HUDI-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-1269. --- Fix Version/s: 0.8.0 Resolution: Fixed > Make whether the failure of connect hive affects hudi ingest process > configurable > - > > Key: HUDI-1269 > URL: https://issues.apache.org/jira/browse/HUDI-1269 > Project: Apache Hudi > Issue Type: New Feature > Components: Hive Integration >Reporter: wangxianghu#1 >Assignee: liujinhui >Priority: Minor > Labels: pull-request-available, user-support-issues > Fix For: 0.8.0 > > > Currently, In an ETL pipeline(eg, kafka -> hudi -> hive), If the process of > hudi to hive failed, the job is still running. > I think we can add a switch to control the job behavior(fail or keep running) > when kafka to hudi is ok, while hudi to hive failed, leave the choice to > user. since ingesting data to hudi and sync to hive is a complete task in > some scenes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch master updated: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (#2443)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 8c2197a [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (#2443) 8c2197a is described below commit 8c2197ae5e9c139e488a33f5a507b79bfa2f6f27 Author: liujinhui <965147...@qq.com> AuthorDate: Thu Feb 25 23:09:32 2021 +0800 [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (#2443) Co-authored-by: Sivabalan Narayanan --- .../main/java/org/apache/hudi/DataSourceUtils.java | 2 + .../scala/org/apache/hudi/DataSourceOptions.scala | 2 + .../org/apache/hudi/HoodieSparkSqlWriter.scala | 1 + .../java/org/apache/hudi/hive/HiveSyncConfig.java | 4 ++ .../java/org/apache/hudi/hive/HiveSyncTool.java| 74 +- .../org/apache/hudi/hive/TestHiveSyncTool.java | 22 +++ 6 files changed, 76 insertions(+), 29 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java index 18c51e3..632a155 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java +++ b/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java @@ -293,6 +293,8 @@ public class DataSourceUtils { DataSourceWriteOptions.DEFAULT_HIVE_USE_JDBC_OPT_VAL())); hiveSyncConfig.autoCreateDatabase = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_AUTO_CREATE_DATABASE_OPT_KEY(), DataSourceWriteOptions.DEFAULT_HIVE_AUTO_CREATE_DATABASE_OPT_KEY())); +hiveSyncConfig.ignoreExceptions = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_IGNORE_EXCEPTIONS_OPT_KEY(), +DataSourceWriteOptions.DEFAULT_HIVE_IGNORE_EXCEPTIONS_OPT_KEY())); hiveSyncConfig.skipROSuffix = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_SKIP_RO_SUFFIX(), DataSourceWriteOptions.DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL())); hiveSyncConfig.supportTimestamp = Boolean.valueOf(props.getString(DataSourceWriteOptions.HIVE_SUPPORT_TIMESTAMP(), diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala index 965b35c..4b8e97c 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala @@ -347,6 +347,7 @@ object DataSourceWriteOptions { val HIVE_USE_PRE_APACHE_INPUT_FORMAT_OPT_KEY = "hoodie.datasource.hive_sync.use_pre_apache_input_format" val HIVE_USE_JDBC_OPT_KEY = "hoodie.datasource.hive_sync.use_jdbc" val HIVE_AUTO_CREATE_DATABASE_OPT_KEY = "hoodie.datasource.hive_sync.auto_create_database" + val HIVE_IGNORE_EXCEPTIONS_OPT_KEY = "hoodie.datasource.hive_sync.ignore_exceptions" val HIVE_SKIP_RO_SUFFIX = "hoodie.datasource.hive_sync.skip_ro_suffix" val HIVE_SUPPORT_TIMESTAMP = "hoodie.datasource.hive_sync.support_timestamp" @@ -365,6 +366,7 @@ object DataSourceWriteOptions { val DEFAULT_USE_PRE_APACHE_INPUT_FORMAT_OPT_VAL = "false" val DEFAULT_HIVE_USE_JDBC_OPT_VAL = "true" val DEFAULT_HIVE_AUTO_CREATE_DATABASE_OPT_KEY = "true" + val DEFAULT_HIVE_IGNORE_EXCEPTIONS_OPT_KEY = "false" val DEFAULT_HIVE_SKIP_RO_SUFFIX_VAL = "false" val DEFAULT_HIVE_SUPPORT_TIMESTAMP = "false" diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala index f5ba6c8..ef28191 100644 --- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala +++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala @@ -374,6 +374,7 @@ private[hudi] object HoodieSparkSqlWriter { hiveSyncConfig.useJdbc = parameters(HIVE_USE_JDBC_OPT_KEY).toBoolean hiveSyncConfig.useFileListingFromMetadata = parameters(HoodieMetadataConfig.METADATA_ENABLE_PROP).toBoolean hiveSyncConfig.verifyMetadataFileListing = parameters(HoodieMetadataConfig.METADATA_VALIDATE_PROP).toBoolean +hiveSyncConfig.ignoreExceptions = parameters.get(HIVE_IGNORE_EXCEPTIONS_OPT_KEY).exists(r => r.toBoolean) hiveSyncConfig.supportTimestamp = parameters.get(HIVE_SUPPORT_TIMESTAMP).exists(r => r.toBoolean) hiveSyncConfig.autoCreateDatabase = parameters.get(HIVE_AUTO_CREATE_DATABASE_OPT_KEY).exists(r => r.toBoolean) hiveSyncConfig.decodePartition = parameters.getOrElse(URL_ENCODE_PARTITIONING_OPT_KEY, diff --git
[jira] [Updated] (HUDI-1269) Make whether the failure of connect hive affects hudi ingest process configurable
[ https://issues.apache.org/jira/browse/HUDI-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1269: -- Status: Open (was: New) > Make whether the failure of connect hive affects hudi ingest process > configurable > - > > Key: HUDI-1269 > URL: https://issues.apache.org/jira/browse/HUDI-1269 > Project: Apache Hudi > Issue Type: New Feature > Components: Hive Integration >Reporter: wangxianghu#1 >Assignee: liujinhui >Priority: Minor > Labels: pull-request-available, user-support-issues > > Currently, In an ETL pipeline(eg, kafka -> hudi -> hive), If the process of > hudi to hive failed, the job is still running. > I think we can add a switch to control the job behavior(fail or keep running) > when kafka to hudi is ok, while hudi to hive failed, leave the choice to > user. since ingesting data to hudi and sync to hive is a complete task in > some scenes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1269) Make whether the failure of connect hive affects hudi ingest process configurable
[ https://issues.apache.org/jira/browse/HUDI-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1269: -- Status: In Progress (was: Open) > Make whether the failure of connect hive affects hudi ingest process > configurable > - > > Key: HUDI-1269 > URL: https://issues.apache.org/jira/browse/HUDI-1269 > Project: Apache Hudi > Issue Type: New Feature > Components: Hive Integration >Reporter: wangxianghu#1 >Assignee: liujinhui >Priority: Minor > Labels: pull-request-available, user-support-issues > > Currently, In an ETL pipeline(eg, kafka -> hudi -> hive), If the process of > hudi to hive failed, the job is still running. > I think we can add a switch to control the job behavior(fail or keep running) > when kafka to hudi is ok, while hudi to hive failed, leave the choice to > user. since ingesting data to hudi and sync to hive is a complete task in > some scenes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] nsivabalan merged pull request #2443: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable
nsivabalan merged pull request #2443: URL: https://github.com/apache/hudi/pull/2443 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Rap70r commented on issue #2586: [SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3?
Rap70r commented on issue #2586: URL: https://github.com/apache/hudi/issues/2586#issuecomment-785941366 Hi nsivabalan, Thank you for your reply. * Incremental updates include both inserts and updates. Mostly updates. * We can try increasing retention version to a higher value and improve readers time. * We would prefer sticking with COPY_ON_WRITE for now. I was wondering if we should look into table caching in Spark: https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html As this would cache the entire table into disk/memory and would work with that. The only downside I can think of is space issues. Are there any other disadvantages when using cache and persist? Also, we're looking into improving reader's speed with combination of increasing retention version value. When reading a S3 Hudi dataset structure, does the number of partition affect the speed of readers? For example, if the table is partitioned into 200 folders or 1000 folders, by choosing different columns, would that affect the speed when reading the table by using Snapshot query: https://hudi.apache.org/docs/querying_data.html#spark-snap-query Thank you This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rubenssoto commented on issue #2588: [SUPPORT] Cannot create hive connection
rubenssoto commented on issue #2588: URL: https://github.com/apache/hudi/issues/2588#issuecomment-785933427 @bvaradar I think it is a hive issue, I'm trying to increase hive heap size, I hope it helps. I process the tables in threads, so I have almost 20 hive connections open. Do you have any experience with Hudi and Hive? Because Hudi probably execute simple queries to verify table schema This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2448: [SUPPORT] deltacommit for client 172.16.116.102 already exists
bvaradar commented on issue #2448: URL: https://github.com/apache/hudi/issues/2448#issuecomment-785932137 @root18039532923 : Please look at https://hudi.apache.org/blog/async-compaction-deployment-model/ for running async compactions This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2555: [SUPPORT] Trying and Understanding Clustering
bvaradar commented on issue #2555: URL: https://github.com/apache/hudi/issues/2555#issuecomment-785930358 For bulkinsert, you need to size the parallelism to control. Please see https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-Whatperformance/ingestlatencycanIexpectforHudiwriting This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2588: [SUPPORT] Cannot create hive connection
bvaradar commented on issue #2588: URL: https://github.com/apache/hudi/issues/2588#issuecomment-785928870 @rubenssoto : The stack-trace does not contain Hudi in it. So, I dont know how to help in this regard. Regarding high cpu load on hive server, Are you also running hive queries apart from HMS integration ? @n3nash @nsivabalan : Any other ideas ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1641) Issue for Integrating Hudi with Kafka using Avro Schema
PRASHANT BHOSALE created HUDI-1641: -- Summary: Issue for Integrating Hudi with Kafka using Avro Schema Key: HUDI-1641 URL: https://issues.apache.org/jira/browse/HUDI-1641 Project: Apache Hudi Issue Type: Bug Components: DeltaStreamer, Spark Integration, Utilities Reporter: PRASHANT BHOSALE Fix For: 0.7.0 I am trying to integrate Hudi with Kafka topic. Steps followed : # Created Kafka topic in confluent with schema defined. # Using kafka-avro-console-producer, I am trying to produce data. # I am running Hudi Delta Streamer in continuous mode. I am getting the below error : {code:java} 21/02/25 13:48:14 ERROR TaskResultGetter: Exception while getting task result21/02/25 13:48:14 ERROR TaskResultGetter: Exception while getting task resultorg.apache.spark.SparkException: Error reading attempting to read avro data -- encountered an unknown fingerprint: 103427103938146401, not sure what schema to use. This could happen if you registered additional schemas after starting your spark context. at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$4.apply(GenericAvroSerializer.scala:141) at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$4.apply(GenericAvroSerializer.scala:138) at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:79) at org.apache.spark.serializer.GenericAvroSerializer.deserializeDatum(GenericAvroSerializer.scala:137) at org.apache.spark.serializer.GenericAvroSerializer.read(GenericAvroSerializer.scala:162) at org.apache.spark.serializer.GenericAvroSerializer.read(GenericAvroSerializer.scala:47) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:391) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:302) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:371) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:88) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:72) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945) at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)21/02/25 13:48:14 INFO YarnScheduler: Removed TaskSet 14.0, whose tasks have all completed, from pool 21/02/25 13:48:14 INFO YarnScheduler: Cancelling stage 1421/02/25 13:48:14 INFO YarnScheduler: Killing all running tasks in stage 14: Stage cancelled21/02/25 13:48:14 INFO DAGScheduler: ResultStage 14 (isEmpty at DeltaSync.java:380) failed in 0.696 s due to Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error reading attempting to read avro data -- encountered an unknown fingerprint: 103427103938146401, not sure what schema to use. This could happen if you registered additional schemas after starting your spark context.21/02/25 13:48:14 INFO DAGScheduler: Job 8 failed: isEmpty at DeltaSync.java:380, took 0.704193 s21/02/25 13:48:14 ERROR HoodieDeltaStreamer: Shutting down delta-sync due to exceptionorg.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.SparkException: Error reading attempting to read avro data -- encountered an unknown fingerprint: 103427103938146401, not sure what schema to use. This could happen if you registered additional schemas after starting your spark context. at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966) at
[GitHub] [hudi] bvaradar commented on a change in pull request #2520: [HUDI-1446] Support skip bootstrapIndex's init in abstract fs view init
bvaradar commented on a change in pull request #2520: URL: https://github.com/apache/hudi/pull/2520#discussion_r582853801 ## File path: hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/DefaultBootstrapIndex.java ## @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.bootstrap.index; + +import org.apache.hudi.common.model.BootstrapFileMapping; +import org.apache.hudi.common.table.HoodieTableMetaClient; + +import java.util.List; + +/** + * Default Bootstrap Index , which is a emtpy implement and not do anything. + */ +public class DefaultBootstrapIndex extends BootstrapIndex { Review comment: Rename to NoOpBootstrapIndex This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2605: [SUPPORT] How to reload a writeConfig from a existed hudi path ?
bvaradar commented on issue #2605: URL: https://github.com/apache/hudi/issues/2605#issuecomment-785906565 Ideally, spark data-source should provide that option (like optionFromFile(...). Not sure if there is anything like that. Created : https://issues.apache.org/jira/browse/HUDI-1640?filter=-2 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1640) Implement Spark Datasource option to read hudi configs from properties file
[ https://issues.apache.org/jira/browse/HUDI-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290922#comment-17290922 ] Balaji Varadarajan commented on HUDI-1640: -- [~shivnarayan]: Can you vet this and add to the work queue ? > Implement Spark Datasource option to read hudi configs from properties file > --- > > Key: HUDI-1640 > URL: https://issues.apache.org/jira/browse/HUDI-1640 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Balaji Varadarajan >Priority: Major > > Provide config option like "hoodie.datasource.props.file" to load all the > options from a file. > > GH: https://github.com/apache/hudi/issues/2605 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1640) Implement Spark Datasource option to read hudi configs from properties file
Balaji Varadarajan created HUDI-1640: Summary: Implement Spark Datasource option to read hudi configs from properties file Key: HUDI-1640 URL: https://issues.apache.org/jira/browse/HUDI-1640 Project: Apache Hudi Issue Type: Improvement Components: Spark Integration Reporter: Balaji Varadarajan Provide config option like "hoodie.datasource.props.file" to load all the options from a file. GH: https://github.com/apache/hudi/issues/2605 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] bvaradar commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?
bvaradar commented on issue #2592: URL: https://github.com/apache/hudi/issues/2592#issuecomment-785896778 I was unable to setup spark-2.3.0 in my setup. But,with spark-2.4.4, this works fine as below. Can you use spark-2.4.x version. spark-2.3 seems too old though ? `21/02/25 05:14:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://ip-192-168-1-81.ec2.internal:4040 Spark context available as 'sc' (master = local[*], app id = local-1614258873363). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_251) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.DataSourceWriteOptions scala> import org.apache.hudi.QuickstartUtils.getQuickstartWriteConfigs import org.apache.hudi.QuickstartUtils.getQuickstartWriteConfigs scala> import org.apache.hudi.config.{HoodieCompactionConfig, HoodieStorageConfig, HoodieWriteConfig} import org.apache.hudi.config.{HoodieCompactionConfig, HoodieStorageConfig, HoodieWriteConfig} scala> import org.apache.spark.sql.{SaveMode, SparkSession} import org.apache.spark.sql.{SaveMode, SparkSession} scala> import org.apache.spark.sql.functions.{col, concat, lit} import org.apache.spark.sql.functions.{col, concat, lit} scala> scala> val inputDF = spark.read.format("csv").option("header", "true").load("file:///tmp/input.csv") inputDF: org.apache.spark.sql.DataFrame = [col_1: string, col_2: string ... 12 more fields] scala> val formattedDF = inputDF.selectExpr("col_1", "cast(col_2 as integer) col_2", | "cast(col_3 as short) col_3", "col_4", "col_5", "cast(col_6 as byte) col_6", "cast(col_7 as decimal(9,2)) col_7", | "cast(col_8 as decimal(9,2)) col_8", "cast(col_9 as timestamp) col_9", "col_10", "cast(col_11 as timestamp) col_11", | "col_12", "cntry_cd", "cast(bus_dt as date) bus_dt") formattedDF: org.apache.spark.sql.DataFrame = [col_1: string, col_2: int ... 12 more fields] scala> formattedDF.printSchema() root |-- col_1: string (nullable = true) |-- col_2: integer (nullable = true) |-- col_3: short (nullable = true) |-- col_4: string (nullable = true) |-- col_5: string (nullable = true) |-- col_6: byte (nullable = true) |-- col_7: decimal(9,2) (nullable = true) |-- col_8: decimal(9,2) (nullable = true) |-- col_9: timestamp (nullable = true) |-- col_10: string (nullable = true) |-- col_11: timestamp (nullable = true) |-- col_12: string (nullable = true) |-- cntry_cd: string (nullable = true) |-- bus_dt: date (nullable = true) scala> formattedDF.show ++-+-+-+-+-+-+-++---++---++--+ | col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8| col_9| col_10| col_11| col_12|cntry_cd|bus_dt| ++-+-+-+-+-+-+-++---++---++--+ |7IN00716079317820...| 716|3| AB| INR| null| 0.00| 1.00|2021-02-14 20:23:...|useridjsb91|2021-02-14 20:23:...|useridjsb91| IN|2021-02-01| |7IN00716079317820...| 716|2| AB| INR| null| 0.00| 1.00|2021-02-14 20:23:...|useridjsb91|2021-02-14 20:23:...|useridjsb91| IN|2021-02-01| |7IN00716079317820...| 716|1| AB| INR| null| 0.00| 1.00|2021-02-14 20:23:...|useridjsb91|2021-02-14 20:23:...|useridjsb91| IN|2021-02-01| |AU700716079381819...| 5700|5| AB| INR| null| 0.00| 1.00|2021-02-14 20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91| IN|2021-02-02| |AU700716079381819...| 5700|6| AB| INR| null| 4.00| 1.97|2021-02-14 20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91| IN|2021-02-02| |AU700716079381819...| 5700|4| AB| INR| null| 0.00| 1.00|2021-02-14 20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91| AU|2021-02-01| |AU700716079381819...| 5700|3| AB| INR| null| 0.00| 1.00|2021-02-14 20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91| AU|2021-02-01| |AU700716079381819...| 5700|1| AB| INR| null| 0.00| 1.00|2021-02-14 20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|
[GitHub] [hudi] codecov-io edited a comment on pull request #2443: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable
codecov-io edited a comment on pull request #2443: URL: https://github.com/apache/hudi/pull/2443#issuecomment-760147630 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2443?src=pr=h1) Report > Merging [#2443](https://codecov.io/gh/apache/hudi/pull/2443?src=pr=desc) (7baf5de) into [master](https://codecov.io/gh/apache/hudi/commit/b0010bf3b449a9d2e01955b0746b795e22e577db?el=desc) (b0010bf) will **increase** coverage by `0.11%`. > The diff coverage is `76.66%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2443/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2443?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2443 +/- ## + Coverage 51.15% 51.26% +0.11% - Complexity 3212 3234 +22 Files 436 438 +2 Lines 1998720107 +120 Branches 2057 2073 +16 + Hits 1022410308 +84 - Misses 8922 8947 +25 - Partials841 852 +11 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `36.87% <ø> (-0.04%)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `51.35% <ø> (-0.04%)` | `0.00 <ø> (ø)` | | | hudiflink | `46.85% <ø> (+1.41%)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `69.71% <0.00%> (-0.03%)` | `0.00 <0.00> (ø)` | | | hudisync | `49.62% <79.31%> (+1.00%)` | `0.00 <8.00> (ø)` | | | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.46% <ø> (+0.07%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2443?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVNwYXJrU3FsV3JpdGVyLnNjYWxh) | `49.82% <0.00%> (+<0.01%)` | `0.00 <0.00> (ø)` | | | [...c/main/java/org/apache/hudi/hive/HiveSyncTool.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZVN5bmNUb29sLmphdmE=) | `70.37% <77.77%> (+1.29%)` | `17.00 <8.00> (+3.00)` | | | [...main/java/org/apache/hudi/hive/HiveSyncConfig.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZVN5bmNDb25maWcuamF2YQ==) | `97.72% <100.00%> (+0.10%)` | `2.00 <0.00> (ø)` | | | [...java/org/apache/hudi/common/util/CleanerUtils.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQ2xlYW5lclV0aWxzLmphdmE=) | `47.72% <0.00%> (-15.91%)` | `6.00% <0.00%> (ø%)` | | | [...udi/operator/partitioner/BucketAssignFunction.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9wYXJ0aXRpb25lci9CdWNrZXRBc3NpZ25GdW5jdGlvbi5qYXZh) | `84.78% <0.00%> (-7.53%)` | `21.00% <0.00%> (+13.00%)` | :arrow_down: | | [...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==) | `78.12% <0.00%> (-1.57%)` | `26.00% <0.00%> (ø%)` | | | [...c/main/java/org/apache/hudi/common/fs/FSUtils.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0ZTVXRpbHMuamF2YQ==) | `49.31% <0.00%> (-0.46%)` | `61.00% <0.00%> (ø%)` | | | [...rg/apache/hudi/cli/commands/SavepointsCommand.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1NhdmVwb2ludHNDb21tYW5kLmphdmE=) | `13.84% <0.00%> (-0.44%)` | `3.00% <0.00%> (ø%)` | | | [...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/hudi/pull/2443/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==) | `55.66% <0.00%> (-0.44%)` | `38.00% <0.00%> (ø%)` | | |
[GitHub] [hudi] codecov-io edited a comment on pull request #2378: [HUDI-1491] Support partition pruning for MOR snapshot query
codecov-io edited a comment on pull request #2378: URL: https://github.com/apache/hudi/pull/2378#issuecomment-751218636 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2378?src=pr=h1) Report > Merging [#2378](https://codecov.io/gh/apache/hudi/pull/2378?src=pr=desc) (ef2107f) into [master](https://codecov.io/gh/apache/hudi/commit/97864a48c1979ca2b3f0579cf26bba81fba7e46c?el=desc) (97864a4) will **decrease** coverage by `41.57%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2378/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2378?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2378 +/- ## - Coverage 51.19% 9.61% -41.58% + Complexity 3226 48 -3178 Files 438 53 -385 Lines 200891944-18145 Branches 2068 235 -1833 - Hits 10285 187-10098 + Misses 89581744 -7214 + Partials846 13 -833 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.61% <ø> (-59.85%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2378?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2378/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>
[jira] [Resolved] (HUDI-1367) Make delastreamer transition from dfsSouce to kafkasouce
[ https://issues.apache.org/jira/browse/HUDI-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-1367. --- Fix Version/s: 0.8.0 Resolution: Fixed > Make delastreamer transition from dfsSouce to kafkasouce > > > Key: HUDI-1367 > URL: https://issues.apache.org/jira/browse/HUDI-1367 > Project: Apache Hudi > Issue Type: Improvement >Reporter: liujinhui >Assignee: liujinhui >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > Currently, after using dfsSouce to write hudi, if you want to use kafkasouce > to continue writing hudi, you need to specify the kafka checkpoint value. I > will make the program automatically get the latest or earliest offect -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-1367) Make delastreamer transition from dfsSouce to kafkasouce
[ https://issues.apache.org/jira/browse/HUDI-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reopened HUDI-1367: --- > Make delastreamer transition from dfsSouce to kafkasouce > > > Key: HUDI-1367 > URL: https://issues.apache.org/jira/browse/HUDI-1367 > Project: Apache Hudi > Issue Type: Improvement >Reporter: liujinhui >Assignee: liujinhui >Priority: Major > Labels: pull-request-available > > Currently, after using dfsSouce to write hudi, if you want to use kafkasouce > to continue writing hudi, you need to specify the kafka checkpoint value. I > will make the program automatically get the latest or earliest offect -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1367) Make delastreamer transition from dfsSouce to kafkasouce
[ https://issues.apache.org/jira/browse/HUDI-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1367: -- Status: Patch Available (was: In Progress) > Make delastreamer transition from dfsSouce to kafkasouce > > > Key: HUDI-1367 > URL: https://issues.apache.org/jira/browse/HUDI-1367 > Project: Apache Hudi > Issue Type: Improvement >Reporter: liujinhui >Assignee: liujinhui >Priority: Major > Labels: pull-request-available > > Currently, after using dfsSouce to write hudi, if you want to use kafkasouce > to continue writing hudi, you need to specify the kafka checkpoint value. I > will make the program automatically get the latest or earliest offect -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1367) Make delastreamer transition from dfsSouce to kafkasouce
[ https://issues.apache.org/jira/browse/HUDI-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1367: -- Status: Closed (was: Patch Available) > Make delastreamer transition from dfsSouce to kafkasouce > > > Key: HUDI-1367 > URL: https://issues.apache.org/jira/browse/HUDI-1367 > Project: Apache Hudi > Issue Type: Improvement >Reporter: liujinhui >Assignee: liujinhui >Priority: Major > Labels: pull-request-available > > Currently, after using dfsSouce to write hudi, if you want to use kafkasouce > to continue writing hudi, you need to specify the kafka checkpoint value. I > will make the program automatically get the latest or earliest offect -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch master updated: [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (#2227)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 617cc24 [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (#2227) 617cc24 is described below commit 617cc24ad1a28196b872df5663e9e0f48cd7f0fa Author: liujinhui <965147...@qq.com> AuthorDate: Thu Feb 25 20:08:13 2021 +0800 [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (#2227) Co-authored-by: Sivabalan Narayanan --- .../utilities/sources/helpers/KafkaOffsetGen.java | 52 ++- .../functional/TestHoodieDeltaStreamer.java| 151 +++-- .../TestHoodieMultiTableDeltaStreamer.java | 21 +-- .../hudi/utilities/sources/TestKafkaSource.java| 2 +- .../utilities/testutils/UtilitiesTestBase.java | 17 ++- 5 files changed, 213 insertions(+), 30 deletions(-) diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java index fc7ba79..e37ec0a 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java @@ -21,6 +21,7 @@ package org.apache.hudi.utilities.sources.helpers; import org.apache.hudi.DataSourceUtils; import org.apache.hudi.common.config.TypedProperties; import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieDeltaStreamerException; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.exception.HoodieNotSupportedException; @@ -40,6 +41,8 @@ import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; +import java.util.regex.Matcher; +import java.util.regex.Pattern; import java.util.stream.Collectors; /** @@ -49,6 +52,12 @@ public class KafkaOffsetGen { private static final Logger LOG = LogManager.getLogger(KafkaOffsetGen.class); + /** + * kafka checkpoint Pattern. + * Format: topic_name,partition_num:offset,partition_num:offset, + */ + private final Pattern pattern = Pattern.compile(".*,.*:.*"); + public static class CheckpointUtils { /** @@ -148,7 +157,8 @@ public class KafkaOffsetGen { private static final String KAFKA_TOPIC_NAME = "hoodie.deltastreamer.source.kafka.topic"; private static final String MAX_EVENTS_FROM_KAFKA_SOURCE_PROP = "hoodie.deltastreamer.kafka.source.maxEvents"; -private static final KafkaResetOffsetStrategies DEFAULT_AUTO_RESET_OFFSET = KafkaResetOffsetStrategies.LATEST; +private static final String KAFKA_AUTO_RESET_OFFSETS = "hoodie.deltastreamer.source.kafka.auto.reset.offsets"; +private static final KafkaResetOffsetStrategies DEFAULT_KAFKA_AUTO_RESET_OFFSETS = KafkaResetOffsetStrategies.LATEST; public static final long DEFAULT_MAX_EVENTS_FROM_KAFKA_SOURCE = 500; public static long maxEventsFromKafkaSource = DEFAULT_MAX_EVENTS_FROM_KAFKA_SOURCE; } @@ -156,15 +166,29 @@ public class KafkaOffsetGen { private final HashMap kafkaParams; private final TypedProperties props; protected final String topicName; + private KafkaResetOffsetStrategies autoResetValue; public KafkaOffsetGen(TypedProperties props) { this.props = props; + kafkaParams = new HashMap<>(); for (Object prop : props.keySet()) { kafkaParams.put(prop.toString(), props.get(prop.toString())); } DataSourceUtils.checkRequiredProperties(props, Collections.singletonList(Config.KAFKA_TOPIC_NAME)); topicName = props.getString(Config.KAFKA_TOPIC_NAME); +String kafkaAutoResetOffsetsStr = props.getString(Config.KAFKA_AUTO_RESET_OFFSETS, Config.DEFAULT_KAFKA_AUTO_RESET_OFFSETS.name()); +boolean found = false; +for (KafkaResetOffsetStrategies entry: KafkaResetOffsetStrategies.values()) { + if (entry.name().toLowerCase().equals(kafkaAutoResetOffsetsStr)) { +found = true; +autoResetValue = entry; +break; + } +} +if (!found) { + throw new HoodieDeltaStreamerException(Config.KAFKA_AUTO_RESET_OFFSETS + " config set to unknown value " + kafkaAutoResetOffsetsStr); +} } public OffsetRange[] getNextOffsetRanges(Option lastCheckpointStr, long sourceLimit, HoodieDeltaStreamerMetrics metrics) { @@ -186,8 +210,6 @@ public class KafkaOffsetGen { fromOffsets = checkupValidOffsets(consumer, lastCheckpointStr, topicPartitions); metrics.updateDeltaStreamerKafkaDelayCountMetrics(delayOffsetCalculation(lastCheckpointStr, topicPartitions, consumer)); } else { -KafkaResetOffsetStrategies autoResetValue = KafkaResetOffsetStrategies -.valueOf(props.getString("auto.offset.reset",
[GitHub] [hudi] nsivabalan merged pull request #2227: [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce
nsivabalan merged pull request #2227: URL: https://github.com/apache/hudi/pull/2227 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2519: [HUDI-1573] Spark Sql Writer support Multi preCmp Field
codecov-io edited a comment on pull request #2519: URL: https://github.com/apache/hudi/pull/2519#issuecomment-771782258 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2519?src=pr=h1) Report > Merging [#2519](https://codecov.io/gh/apache/hudi/pull/2519?src=pr=desc) (6f0fd84) into [master](https://codecov.io/gh/apache/hudi/commit/06dc7c7fd8a867a1e1da90f7dc19b0cc2da69bba?el=desc) (06dc7c7) will **increase** coverage by `18.14%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2519/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2519?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2519 +/- ## = + Coverage 51.22% 69.36% +18.14% + Complexity 3230 356 -2874 = Files 438 53 -385 Lines 20093 1929-18164 Branches 2069 230 -1839 = - Hits 10292 1338 -8954 + Misses 8954 458 -8496 + Partials847 133 -714 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.36% <ø> (-0.16%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2519?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `64.53% <0.00%> (-1.17%)` | `32.00% <0.00%> (ø%)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.00% <0.00%> (-0.36%)` | `50.00% <0.00%> (-1.00%)` | | | [.../apache/hudi/metadata/HoodieTableMetadataUtil.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllVGFibGVNZXRhZGF0YVV0aWwuamF2YQ==) | | | | | [...a/org/apache/hudi/common/util/CompactionUtils.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQ29tcGFjdGlvblV0aWxzLmphdmE=) | | | | | [...e/hudi/common/table/log/HoodieLogFormatReader.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRSZWFkZXIuamF2YQ==) | | | | | [.../apache/hudi/common/table/log/HoodieLogFormat.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXQuamF2YQ==) | | | | | [...main/scala/org/apache/hudi/HoodieWriterUtils.scala](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZVdyaXRlclV0aWxzLnNjYWxh) | | | | | [.../org/apache/hudi/common/model/BaseAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0Jhc2VBdnJvUGF5bG9hZC5qYXZh) | | | | | [...di/source/JsonStringToHoodieRecordMapFunction.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvSnNvblN0cmluZ1RvSG9vZGllUmVjb3JkTWFwRnVuY3Rpb24uamF2YQ==) | | | | | [...ava/org/apache/hudi/payload/AWSDmsAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvcGF5bG9hZC9BV1NEbXNBdnJvUGF5bG9hZC5qYXZh) | | | | | ... and [374 more](https://codecov.io/gh/apache/hudi/pull/2519/diff?src=pr=tree-more) | | This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.
[GitHub] [hudi] rakeshramakrishnan edited a comment on issue #2439: [SUPPORT] Unable to sync with external hive metastore via metastore uris in the thrift protocol
rakeshramakrishnan edited a comment on issue #2439: URL: https://github.com/apache/hudi/issues/2439#issuecomment-785795722 @nsivabalan : There are no errors, however through hudi, the connection is made to the local hive metastore (from spark). It doesn't connect to the external hive metastore. But, without hudi, the spark catalog fetches tables hive tables from the external metastore ``` spark = SparkSession.builder \ .appName("test-hudi-hive-sync") \ .enableHiveSupport() \ .config("hive.metastore.uris", metastore_uri) \ .getOrCreate() print("Before {}".format(spark.catalog.listTables())) --> returns tables from `metastore_uri` ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rakeshramakrishnan commented on issue #2439: [SUPPORT] Unable to sync with external hive metastore via metastore uris in the thrift protocol
rakeshramakrishnan commented on issue #2439: URL: https://github.com/apache/hudi/issues/2439#issuecomment-785795722 @nsivabalan : There are no errors, however through hudi, the connection is made to the local hive metastore (from spark). It doesn't connect to the external hive metastore. But, without hudi, the spark catalog fetches tables hive tables from the external metastore ``` spark = SparkSession.builder \ .appName("test-hudi-hive-sync") \ .enableHiveSupport() \ .config("hive.metastore.uris", metastore_uri) \ .getOrCreate() print("Before {}".format(spark.catalog.listTables())) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Xoln commented on a change in pull request #2520: [HUDI-1446] Support skip bootstrapIndex's init in abstract fs view init
Xoln commented on a change in pull request #2520: URL: https://github.com/apache/hudi/pull/2520#discussion_r582717002 ## File path: hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/DefaultBootstrapIndex.java ## @@ -0,0 +1,61 @@ +package org.apache.hudi.common.bootstrap.index; + +import org.apache.hudi.common.model.BootstrapFileMapping; +import org.apache.hudi.common.table.HoodieTableMetaClient; + +import java.util.*; + +public class DefaultBootstrapIndex extends BootstrapIndex { Review comment: yes , you are right. When init AbstractTableFileSystemView , hoodie.bootstrap.index.class read the default config and to be init . I think it's better to be the index which has no operation and has not any dependencies. (Because there is not check whether index is null at some code) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2604: [hudi-1639][hudi-flink] fix BucketAssigner npe
codecov-io commented on pull request #2604: URL: https://github.com/apache/hudi/pull/2604#issuecomment-785704660 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2604?src=pr=h1) Report > Merging [#2604](https://codecov.io/gh/apache/hudi/pull/2604?src=pr=desc) (fed6575) into [master](https://codecov.io/gh/apache/hudi/commit/06dc7c7fd8a867a1e1da90f7dc19b0cc2da69bba?el=desc) (06dc7c7) will **decrease** coverage by `41.52%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2604/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2604?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2604 +/- ## - Coverage 51.22% 9.69% -41.53% + Complexity 3230 48 -3182 Files 438 53 -385 Lines 200931929-18164 Branches 2069 230 -1839 - Hits 10292 187-10105 + Misses 89541729 -7225 + Partials847 13 -834 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.69% <ø> (-59.83%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2604?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2604/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>