Re: [I] Cloudwatch metrics not published in moving from 0.12.1 to 0.14[SUPPORT] [hudi]
ad1happy2go commented on issue #11205: URL: https://github.com/apache/hudi/issues/11205#issuecomment-2109345823 @ajain-cohere Can you post the complete stack trace. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
hudi-bot commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109343812 ## CI report: * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896) * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897) * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23899) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Error executing Merge Or Read [hudi]
jai20242 commented on issue #11199: URL: https://github.com/apache/hudi/issues/11199#issuecomment-2109341298 And why does it only happen with Merge On Read? Also, I have tested the version 1.0.0-beta and it doesn't happen (it works well but we can't use a beta version in production) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
jonvex commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599415963 ## hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java: ## @@ -87,6 +87,8 @@ public class HoodieStorageConfig extends HoodieConfig { .withDocumentation("Lower values increase the size in bytes of metadata tracked within HFile, but can offer potentially " + "faster lookup times."); + + Review Comment: remove extra lines -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
hudi-bot commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109334800 ## CI report: * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896) * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897) * 6ab5ee102486ffe40778da29cd7eb2733e3f6b0e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7617] Fix issues for bulk insert user defined partitioner in StreamSync [hudi]
hudi-bot commented on PR #11014: URL: https://github.com/apache/hudi/pull/11014#issuecomment-2109334306 ## CI report: * 1ee0f29f2bd4b02aeb3370d864cbdae946be809e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23895) * 33710549e6c4071bd327ef528e17302e42bf829c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23898) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]
hudi-bot commented on PR #10922: URL: https://github.com/apache/hudi/pull/10922#issuecomment-2109334107 ## CI report: * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23894) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] In hudi 0.14.0, the hoodie.properties file is modified with each micro batch. [hudi]
CaesarWangX closed issue #11200: [SUPPORT] In hudi 0.14.0, the hoodie.properties file is modified with each micro batch. URL: https://github.com/apache/hudi/issues/11200 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hoodie.datasource.read.file.index.listing.mode is always eager [hudi]
CaesarWangX closed issue #11201: [SUPPORT]hoodie.datasource.read.file.index.listing.mode is always eager URL: https://github.com/apache/hudi/issues/11201 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
hudi-bot commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109290453 ## CI report: * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896) * 4442f34765c904d3995fd5047c2e8a6197525c5b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23897) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7617] Fix issues for bulk insert user defined partitioner in StreamSync [hudi]
hudi-bot commented on PR #11014: URL: https://github.com/apache/hudi/pull/11014#issuecomment-2109289845 ## CI report: * ca6231f4648a3cfe9e1a14aa76987d2f26a69919 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23237) * 1ee0f29f2bd4b02aeb3370d864cbdae946be809e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23895) * 33710549e6c4071bd327ef528e17302e42bf829c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]
hudi-bot commented on PR #10922: URL: https://github.com/apache/hudi/pull/10922#issuecomment-2109289418 ## CI report: * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
ziudu commented on issue #11204: URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109284886 I'm a newbie. It took me a while to understand why bucket join does not work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
hudi-bot commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109282123 ## CI report: * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896) * 4442f34765c904d3995fd5047c2e8a6197525c5b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] DELETE Statement Deleting Another Record [hudi]
Amar1404 opened a new issue, #11212: URL: https://github.com/apache/hudi/issues/11212 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I have duplicated keys in hudi table due to the insert statement, when I tried deleting the key based on a different filter both the keys were deleted A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1. Create a table using Insert two records with the same key on without partition table. 2. Try to delete the record of the key in only one row by using key and _hoodie_commit_seqno 3. now check the table the table will delete both the record 4. **Expected behavior** The delete command should only delete the one row which was used for filtering **Environment Description** * Hudi version : 0.12.3 * Spark version : 3.3 * Hive version : 3 * Hadoop version : * Storage (HDFS/S3/GCS..) : s3 * Running on Docker? (yes/no) : no **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
hudi-bot commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109274982 ## CI report: * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]
hudi-bot commented on PR #10922: URL: https://github.com/apache/hudi/pull/10922#issuecomment-2109274435 ## CI report: * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23894) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599368691 ## hudi-common/src/test/java/org/apache/hudi/common/testutils/reader/HoodieFileSliceTestUtils.java: ## @@ -207,7 +208,7 @@ private static HoodieDataBlock createDataBlock( false, header, HoodieRecord.RECORD_KEY_METADATA_FIELD, -CompressionCodecName.GZIP, +"gzip", Review Comment: Replaced such occurrences with the default config value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599366211 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java: ## @@ -74,11 +61,10 @@ * base file format. */ public class HoodieHFileDataBlock extends HoodieDataBlock { + public static final String HFILE_COMPRESSION_ALGO_PARAM_KEY = "hfile_compression_algo"; Review Comment: Fixed by using `HFILE_COMPRESSION_ALGORITHM_NAME.key()` directly. Also, I directly pass the String value of the config down so the String value is directly converted to the corresponding `Compression.Algorithm`, like `ParquetUtils`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599351065 ## hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java: ## @@ -99,29 +90,17 @@ public HoodieLogBlockType getBlockType() { @Override protected byte[] serializeRecords(List records, StorageConfiguration storageConf) throws IOException { -if (records.size() == 0) { - return new byte[0]; -} - -Schema writerSchema = new Schema.Parser().parse(super.getLogBlockHeader().get(HeaderMetadataType.SCHEMA)); -ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); -HoodieConfig config = new HoodieConfig(); -config.setValue(PARQUET_COMPRESSION_CODEC_NAME.key(), compressionCodecName.get().name()); -config.setValue(PARQUET_BLOCK_SIZE.key(), String.valueOf(ParquetWriter.DEFAULT_BLOCK_SIZE)); -config.setValue(PARQUET_PAGE_SIZE.key(), String.valueOf(ParquetWriter.DEFAULT_PAGE_SIZE)); -config.setValue(PARQUET_MAX_FILE_SIZE.key(), String.valueOf(1024 * 1024 * 1024)); -config.setValue(PARQUET_COMPRESSION_RATIO_FRACTION.key(), String.valueOf(expectedCompressionRatio.get())); -config.setValue(PARQUET_DICTIONARY_ENABLED, String.valueOf(useDictionaryEncoding.get())); -HoodieRecordType recordType = records.iterator().next().getRecordType(); -try (HoodieFileWriter parquetWriter = HoodieFileWriterFactory.getFileWriter( -HoodieFileFormat.PARQUET, outputStream, storageConf, config, writerSchema, recordType)) { - for (HoodieRecord record : records) { -String recordKey = getRecordKey(record).orElse(null); -parquetWriter.write(recordKey, record, writerSchema); - } - outputStream.flush(); -} -return outputStream.toByteArray(); +Map paramsMap = new HashMap<>(); +paramsMap.put(PARQUET_COMPRESSION_CODEC_NAME.key(), compressionCodecName.get()); +paramsMap.put(PARQUET_COMPRESSION_RATIO_FRACTION.key(), String.valueOf(expectedCompressionRatio.get())); +paramsMap.put(PARQUET_DICTIONARY_ENABLED.key(), String.valueOf(useDictionaryEncoding.get())); + +return FileFormatUtils.getInstance(PARQUET).serializeRecordsToLogBlock( +storageConf, records, +new Schema.Parser().parse(super.getLogBlockHeader().get(HoodieLogBlock.HeaderMetadataType.SCHEMA)), Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599348814 ## hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java: ## @@ -366,6 +382,35 @@ public void writeMetaFile(HoodieStorage storage, } } + @Override + public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf, + List records, + Schema writerSchema, + Schema readerSchema, + String keyFieldName, + Map paramsMap) throws IOException { +if (records.size() == 0) { + return new byte[0]; +} + +ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); +HoodieConfig config = new HoodieConfig(); +paramsMap.entrySet().stream().forEach(entry -> config.setValue(entry.getKey(), entry.getValue())); +config.setValue(PARQUET_BLOCK_SIZE.key(), String.valueOf(ParquetWriter.DEFAULT_BLOCK_SIZE)); +config.setValue(PARQUET_PAGE_SIZE.key(), String.valueOf(ParquetWriter.DEFAULT_PAGE_SIZE)); +config.setValue(PARQUET_MAX_FILE_SIZE.key(), String.valueOf(1024 * 1024 * 1024)); Review Comment: This PR only moves the code. I've created a follow-up to revisit these hardcoded config values, HUDI-7755. My understanding is that for log blocks, current settings are good enough for log scanning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7755) Revisit the configs in ParquetUtils.serializeRecordsToLogBlock
[ https://issues.apache.org/jira/browse/HUDI-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7755: Description: For serializing log records to Parquet log blocks, there are hardcoded config values for writing the records in parquet format (serializeRecordsToLogBlock) > Revisit the configs in ParquetUtils.serializeRecordsToLogBlock > -- > > Key: HUDI-7755 > URL: https://issues.apache.org/jira/browse/HUDI-7755 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 1.1.0 > > > For serializing log records to Parquet log blocks, there are hardcoded config > values for writing the records in parquet format (serializeRecordsToLogBlock) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7755) Revisit the configs in ParquetUtils.serializeRecordsToLogBlock
[ https://issues.apache.org/jira/browse/HUDI-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7755: Description: For serializing log records to Parquet log blocks, there are hardcoded config values for writing the records in parquet format (ParquetUtils.serializeRecordsToLogBlock). We need to revisit this part of logic to see if they should be configurable. (was: For serializing log records to Parquet log blocks, there are hardcoded config values for writing the records in parquet format (serializeRecordsToLogBlock)) > Revisit the configs in ParquetUtils.serializeRecordsToLogBlock > -- > > Key: HUDI-7755 > URL: https://issues.apache.org/jira/browse/HUDI-7755 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 1.1.0 > > > For serializing log records to Parquet log blocks, there are hardcoded config > values for writing the records in parquet format > (ParquetUtils.serializeRecordsToLogBlock). We need to revisit this part of > logic to see if they should be configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7755) Revisit the configs in ParquetUtils.serializeRecordsToLogBlock
[ https://issues.apache.org/jira/browse/HUDI-7755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7755: Fix Version/s: 1.1.0 > Revisit the configs in ParquetUtils.serializeRecordsToLogBlock > -- > > Key: HUDI-7755 > URL: https://issues.apache.org/jira/browse/HUDI-7755 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Priority: Major > Fix For: 1.1.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7755) Revisit the configs in ParquetUtils.serializeRecordsToLogBlock
Ethan Guo created HUDI-7755: --- Summary: Revisit the configs in ParquetUtils.serializeRecordsToLogBlock Key: HUDI-7755 URL: https://issues.apache.org/jira/browse/HUDI-7755 Project: Apache Hudi Issue Type: Improvement Reporter: Ethan Guo -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
hudi-bot commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109237024 ## CI report: * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23892) * 1e7ab5d044f35d65670bb0fc442721e01a677d8d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23896) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599346193 ## hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java: ## @@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() { public void writeMetaFile(HoodieStorage storage, StoragePath filePath, Properties props) throws IOException { throw new UnsupportedOperationException("HFileUtils does not support writeMetaFile"); } + + @Override + public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf, + List records, + Schema writerSchema, + Schema readerSchema, + String keyFieldName, + Map paramsMap) throws IOException { +Compression.Algorithm compressionAlgorithm = getHFileCompressionAlgorithm(paramsMap); +HFileContext context = new HFileContextBuilder() +.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE) +.withCompression(compressionAlgorithm) +.withCellComparator(new HoodieHBaseKVComparator()) +.build(); + +Configuration conf = storageConf.unwrapAs(Configuration.class); +CacheConfig cacheConfig = new CacheConfig(conf); +ByteArrayOutputStream baos = new ByteArrayOutputStream(); +FSDataOutputStream ostream = new FSDataOutputStream(baos, null); + +// Use simple incrementing counter as a key +boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, keyFieldName).isPresent(); +// This is set here to avoid re-computing this in the loop +int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 1 : -1; + +// Serialize records into bytes +Map> sortedRecordsMap = new TreeMap<>(); + +Iterator itr = records.iterator(); +int id = 0; +while (itr.hasNext()) { + HoodieRecord record = itr.next(); + String recordKey; + if (useIntegerKey) { +recordKey = String.format("%" + keyWidth + "s", id++); + } else { +recordKey = getRecordKey(record, readerSchema, keyFieldName).get(); + } + + final byte[] recordBytes = serializeRecord(record, writerSchema, keyFieldName); + // If key exists in the map, append to its list. If not, create a new list. + // Get the existing list of recordBytes for the recordKey, or an empty list if it doesn't exist + List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, new ArrayList<>()); + recordBytesList.add(recordBytes); + // Put the updated list back into the map + sortedRecordsMap.put(recordKey, recordBytesList); +} + +HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig) +.withOutputStream(ostream).withFileContext(context).create(); + +// Write the records +sortedRecordsMap.forEach((recordKey, recordBytesList) -> { + for (byte[] recordBytes : recordBytesList) { +try { + KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, recordBytes); + writer.append(kv); +} catch (IOException e) { + throw new HoodieIOException("IOException serializing records", e); +} + } +}); + +writer.appendFileInfo( +getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), getUTF8Bytes(readerSchema.toString())); + +writer.close(); +ostream.flush(); +ostream.close(); + +return baos.toByteArray(); + } + + private Option getRecordKey(HoodieRecord record, Schema readerSchema, String keyFieldName) { +return Option.ofNullable(record.getRecordKey(readerSchema, keyFieldName)); + } + + private byte[] serializeRecord(HoodieRecord record, Schema schema, String keyFieldName) throws IOException { Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7617] Fix issues for bulk insert user defined partitioner in StreamSync [hudi]
hudi-bot commented on PR #11014: URL: https://github.com/apache/hudi/pull/11014#issuecomment-2109236780 ## CI report: * ca6231f4648a3cfe9e1a14aa76987d2f26a69919 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23237) * 1ee0f29f2bd4b02aeb3370d864cbdae946be809e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23895) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599346082 ## hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java: ## @@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() { public void writeMetaFile(HoodieStorage storage, StoragePath filePath, Properties props) throws IOException { throw new UnsupportedOperationException("HFileUtils does not support writeMetaFile"); } + + @Override + public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf, + List records, + Schema writerSchema, + Schema readerSchema, + String keyFieldName, + Map paramsMap) throws IOException { +Compression.Algorithm compressionAlgorithm = getHFileCompressionAlgorithm(paramsMap); +HFileContext context = new HFileContextBuilder() +.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE) +.withCompression(compressionAlgorithm) +.withCellComparator(new HoodieHBaseKVComparator()) +.build(); + +Configuration conf = storageConf.unwrapAs(Configuration.class); +CacheConfig cacheConfig = new CacheConfig(conf); +ByteArrayOutputStream baos = new ByteArrayOutputStream(); +FSDataOutputStream ostream = new FSDataOutputStream(baos, null); + +// Use simple incrementing counter as a key +boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, keyFieldName).isPresent(); +// This is set here to avoid re-computing this in the loop +int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 1 : -1; + +// Serialize records into bytes +Map> sortedRecordsMap = new TreeMap<>(); + +Iterator itr = records.iterator(); +int id = 0; +while (itr.hasNext()) { + HoodieRecord record = itr.next(); + String recordKey; + if (useIntegerKey) { +recordKey = String.format("%" + keyWidth + "s", id++); + } else { +recordKey = getRecordKey(record, readerSchema, keyFieldName).get(); + } + + final byte[] recordBytes = serializeRecord(record, writerSchema, keyFieldName); + // If key exists in the map, append to its list. If not, create a new list. + // Get the existing list of recordBytes for the recordKey, or an empty list if it doesn't exist + List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, new ArrayList<>()); + recordBytesList.add(recordBytes); + // Put the updated list back into the map + sortedRecordsMap.put(recordKey, recordBytesList); +} + +HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig) +.withOutputStream(ostream).withFileContext(context).create(); + +// Write the records +sortedRecordsMap.forEach((recordKey, recordBytesList) -> { + for (byte[] recordBytes : recordBytesList) { +try { + KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, recordBytes); + writer.append(kv); +} catch (IOException e) { + throw new HoodieIOException("IOException serializing records", e); +} + } +}); + +writer.appendFileInfo( +getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), getUTF8Bytes(readerSchema.toString())); + +writer.close(); +ostream.flush(); +ostream.close(); + +return baos.toByteArray(); + } + + private Option getRecordKey(HoodieRecord record, Schema readerSchema, String keyFieldName) { Review Comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]
hudi-bot commented on PR #10922: URL: https://github.com/apache/hudi/pull/10922#issuecomment-210923 ## CI report: * 1c36f92dbff0e9be085a409d28cb9403a0343781 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23866) * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23894) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599345810 ## hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java: ## @@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() { public void writeMetaFile(HoodieStorage storage, StoragePath filePath, Properties props) throws IOException { throw new UnsupportedOperationException("HFileUtils does not support writeMetaFile"); } + + @Override + public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf, + List records, + Schema writerSchema, + Schema readerSchema, + String keyFieldName, + Map paramsMap) throws IOException { +Compression.Algorithm compressionAlgorithm = getHFileCompressionAlgorithm(paramsMap); +HFileContext context = new HFileContextBuilder() +.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE) +.withCompression(compressionAlgorithm) +.withCellComparator(new HoodieHBaseKVComparator()) +.build(); + +Configuration conf = storageConf.unwrapAs(Configuration.class); +CacheConfig cacheConfig = new CacheConfig(conf); +ByteArrayOutputStream baos = new ByteArrayOutputStream(); +FSDataOutputStream ostream = new FSDataOutputStream(baos, null); + +// Use simple incrementing counter as a key +boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, keyFieldName).isPresent(); +// This is set here to avoid re-computing this in the loop +int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 1 : -1; + +// Serialize records into bytes +Map> sortedRecordsMap = new TreeMap<>(); + +Iterator itr = records.iterator(); +int id = 0; +while (itr.hasNext()) { + HoodieRecord record = itr.next(); + String recordKey; + if (useIntegerKey) { +recordKey = String.format("%" + keyWidth + "s", id++); + } else { +recordKey = getRecordKey(record, readerSchema, keyFieldName).get(); + } + + final byte[] recordBytes = serializeRecord(record, writerSchema, keyFieldName); + // If key exists in the map, append to its list. If not, create a new list. + // Get the existing list of recordBytes for the recordKey, or an empty list if it doesn't exist + List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, new ArrayList<>()); + recordBytesList.add(recordBytes); + // Put the updated list back into the map + sortedRecordsMap.put(recordKey, recordBytesList); +} + +HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig) +.withOutputStream(ostream).withFileContext(context).create(); + +// Write the records +sortedRecordsMap.forEach((recordKey, recordBytesList) -> { + for (byte[] recordBytes : recordBytesList) { +try { + KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, recordBytes); + writer.append(kv); +} catch (IOException e) { + throw new HoodieIOException("IOException serializing records", e); +} + } +}); + +writer.appendFileInfo( +getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), getUTF8Bytes(readerSchema.toString())); + +writer.close(); +ostream.flush(); Review Comment: This is flushing the data to the `ByteArrayOutputStream` after the writer, and `write.close()` flushes the data internally. This PR only moves this part of code from `HoodieHFileDataBlock` to `HFileUtils` class only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hoodie.datasource.read.file.index.listing.mode is always eager [hudi]
danny0405 commented on issue #11201: URL: https://github.com/apache/hudi/issues/11201#issuecomment-2109232721 > It seems that this issue has been fixed in version 0.14.1 yeah, you got it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]
hudi-bot commented on PR #10922: URL: https://github.com/apache/hudi/pull/10922#issuecomment-2109230347 ## CI report: * 1c36f92dbff0e9be085a409d28cb9403a0343781 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23866) * 802eeb74510ddbeb5cd952d9192aaf623d9c7ee9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
hudi-bot commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109230874 ## CI report: * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23892) * 1e7ab5d044f35d65670bb0fc442721e01a677d8d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7617] Fix issues for bulk insert user defined partitioner in StreamSync [hudi]
hudi-bot commented on PR #11014: URL: https://github.com/apache/hudi/pull/11014#issuecomment-2109230509 ## CI report: * ca6231f4648a3cfe9e1a14aa76987d2f26a69919 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23237) * 1ee0f29f2bd4b02aeb3370d864cbdae946be809e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599341130 ## hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java: ## @@ -35,21 +39,54 @@ import org.apache.avro.Schema; import org.apache.avro.generic.GenericRecord; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataOutputStream; +import org.apache.hadoop.hbase.KeyValue; +import org.apache.hadoop.hbase.io.compress.Compression; +import org.apache.hadoop.hbase.io.hfile.CacheConfig; +import org.apache.hadoop.hbase.io.hfile.HFile; +import org.apache.hadoop.hbase.io.hfile.HFileContext; +import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.io.ByteArrayOutputStream; import java.io.IOException; +import java.util.ArrayList; +import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.Set; +import java.util.TreeMap; + +import static org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.HFILE_COMPRESSION_ALGO_PARAM_KEY; +import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes; /** * Utility functions for HFile files. */ -public class HFileUtils extends BaseFileUtils { - +public class HFileUtils extends FileFormatUtils { private static final Logger LOG = LoggerFactory.getLogger(HFileUtils.class); + private static final int DEFAULT_BLOCK_SIZE_FOR_LOG_FILE = 1024 * 1024; + + /** + * Gets the {@link Compression.Algorithm} Enum based on the {@link CompressionCodec} name. + * + * @param paramsMap parameter map containing the compression codec config. + * @return the {@link Compression.Algorithm} Enum. + */ + public static Compression.Algorithm getHFileCompressionAlgorithm(Map paramsMap) { +String algoName = paramsMap.get(HFILE_COMPRESSION_ALGO_PARAM_KEY); +if (algoName == null) { Review Comment: Fixed. A new test is added. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109225533 > also not a fan of the `org.apache.hudi.io.compress.` package name. But probably too late to change now Since the compression logic is also under the scope of IO, so we put the package name like this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
hudi-bot commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109224269 ## CI report: * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23892) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]
danny0405 commented on issue #11202: URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109224081 ```java Caused by: java.util.NoSuchElementException: FileID x of partition path dt=2019-02-20 does not exist. at org.apache.hudi.io.HoodieMergeHandle.getLatestBaseFile(HoodieMergeHandle.java:159) at org.apache.hudi.io.HoodieMergeHandle.(HoodieMergeHandle.java:121) at org.apache.hudi.io.FlinkMergeHandle.(FlinkMergeHandle.java:70) at org.apache.hudi.io.FlinkConcatHandle.(FlinkConcatHandle.java:53) at org.apache.hudi.client.HoodieFlinkWriteClient.getOrCreateWriteHandle(HoodieFlinkWriteClient.java:557) at org.apache.hudi.client.HoodieFlinkWriteClient.insert(HoodieFlinkWriteClient.java:175) at org.apache.hudi.sink.StreamWriteFunction.lambda$initWriteFunction$0(StreamWriteFunction.java:181) at org.apache.hudi.sink.StreamWriteFunction.lambda$flushRemaining$7(StreamWriteFunction.java:461) ``` The error msg indicates that you enabled the inline clustering for Flink, can you disable that and try again by using the async clustering instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]
hudi-bot commented on PR #11035: URL: https://github.com/apache/hudi/pull/11035#issuecomment-2109223951 ## CI report: * 074845c216002fc00c28dcbb7720ffc05bdc7e8f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23891) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
jonvex commented on code in PR #11210: URL: https://github.com/apache/hudi/pull/11210#discussion_r1599326562 ## hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java: ## @@ -128,4 +165,89 @@ public HoodieFileFormat getFormat() { public void writeMetaFile(HoodieStorage storage, StoragePath filePath, Properties props) throws IOException { throw new UnsupportedOperationException("HFileUtils does not support writeMetaFile"); } + + @Override + public byte[] serializeRecordsToLogBlock(StorageConfiguration storageConf, + List records, + Schema writerSchema, + Schema readerSchema, + String keyFieldName, + Map paramsMap) throws IOException { +Compression.Algorithm compressionAlgorithm = getHFileCompressionAlgorithm(paramsMap); +HFileContext context = new HFileContextBuilder() +.withBlockSize(DEFAULT_BLOCK_SIZE_FOR_LOG_FILE) +.withCompression(compressionAlgorithm) +.withCellComparator(new HoodieHBaseKVComparator()) +.build(); + +Configuration conf = storageConf.unwrapAs(Configuration.class); +CacheConfig cacheConfig = new CacheConfig(conf); +ByteArrayOutputStream baos = new ByteArrayOutputStream(); +FSDataOutputStream ostream = new FSDataOutputStream(baos, null); + +// Use simple incrementing counter as a key +boolean useIntegerKey = !getRecordKey(records.get(0), readerSchema, keyFieldName).isPresent(); +// This is set here to avoid re-computing this in the loop +int keyWidth = useIntegerKey ? (int) Math.ceil(Math.log(records.size())) + 1 : -1; + +// Serialize records into bytes +Map> sortedRecordsMap = new TreeMap<>(); + +Iterator itr = records.iterator(); +int id = 0; +while (itr.hasNext()) { + HoodieRecord record = itr.next(); + String recordKey; + if (useIntegerKey) { +recordKey = String.format("%" + keyWidth + "s", id++); + } else { +recordKey = getRecordKey(record, readerSchema, keyFieldName).get(); + } + + final byte[] recordBytes = serializeRecord(record, writerSchema, keyFieldName); + // If key exists in the map, append to its list. If not, create a new list. + // Get the existing list of recordBytes for the recordKey, or an empty list if it doesn't exist + List recordBytesList = sortedRecordsMap.getOrDefault(recordKey, new ArrayList<>()); + recordBytesList.add(recordBytes); + // Put the updated list back into the map + sortedRecordsMap.put(recordKey, recordBytesList); +} + +HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig) +.withOutputStream(ostream).withFileContext(context).create(); + +// Write the records +sortedRecordsMap.forEach((recordKey, recordBytesList) -> { + for (byte[] recordBytes : recordBytesList) { +try { + KeyValue kv = new KeyValue(recordKey.getBytes(), null, null, recordBytes); + writer.append(kv); +} catch (IOException e) { + throw new HoodieIOException("IOException serializing records", e); +} + } +}); + +writer.appendFileInfo( +getUTF8Bytes(HoodieAvroHFileReaderImplBase.SCHEMA_KEY), getUTF8Bytes(readerSchema.toString())); + +writer.close(); +ostream.flush(); Review Comment: Wouldn't we want to flush before closing the writer? ## hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/HFileUtils.java: ## @@ -35,21 +39,54 @@ import org.apache.avro.Schema; import org.apache.avro.generic.GenericRecord; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataOutputStream; +import org.apache.hadoop.hbase.KeyValue; +import org.apache.hadoop.hbase.io.compress.Compression; +import org.apache.hadoop.hbase.io.hfile.CacheConfig; +import org.apache.hadoop.hbase.io.hfile.HFile; +import org.apache.hadoop.hbase.io.hfile.HFileContext; +import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.io.ByteArrayOutputStream; import java.io.IOException; +import java.util.ArrayList; +import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.Set; +import java.util.TreeMap; + +import static org.apache.hudi.common.table.log.block.HoodieHFileDataBlock.HFILE_COMPRESSION_ALGO_PARAM_KEY; +import static org.apache.hudi.common.util.StringUtils.getUTF8Bytes; /** * Utility functions for HFile files. */ -public class HFileUtils extends BaseFileUtils { - +public class HFileUtils extends FileFormatUtils { private static final Logger LOG = LoggerFactory.getLogger(HFileUtils.class); + private static final int DEFAULT_BLOCK_SIZE_FOR_LOG_FILE = 1024 * 1024; + +
Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
danny0405 commented on issue #11204: URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109221161 > So if we have to choose one between spark and hive, I think spark might be of higher priority I agree, do you have energy to complete that suspended PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
yihua merged PR #11208: URL: https://github.com/apache/hudi/pull/11208 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
yihua commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599331946 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -58,27 +58,10 @@ public class HoodieHadoopStorage extends HoodieStorage { private final FileSystem fs; - public HoodieHadoopStorage(HoodieStorage storage) { Review Comment: Makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
yihua commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599330264 ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieIOFactory.java: ## @@ -48,4 +48,13 @@ private static HoodieIOFactory getIOFactory(String ioFactoryClass) { public abstract HoodieFileWriterFactory getWriterFactory(HoodieRecord.HoodieRecordType recordType); + public abstract HoodieStorage getStorage(StoragePath storagePath); + + public abstract HoodieStorage getStorage(StoragePath path, Review Comment: OK. We take this on separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
yihua commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599329583 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -470,9 +470,9 @@ public void performMergeDataValidationCheck(WriteStatus writeStatus) { } long oldNumWrites = 0; -try (HoodieFileReader reader = HoodieIOFactory.getIOFactory(storage.getConf()) +try (HoodieFileReader reader = HoodieIOFactory.getIOFactory(hoodieTable.getStorageConf()) Review Comment: Sg -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]
yihua commented on code in PR #10922: URL: https://github.com/apache/hudi/pull/10922#discussion_r1599328354 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -654,22 +640,6 @@ private HoodieLogBlock.HoodieLogBlockType pickLogDataBlockFormat() { } } - private static Map getUpdatedHeader(Map header, int blockSequenceNumber, long attemptNumber, - HoodieWriteConfig config, boolean addBlockIdentifier) { -Map updatedHeader = new HashMap<>(header); -if (addBlockIdentifier && !HoodieTableMetadata.isMetadataTable(config.getBasePath())) { // add block sequence numbers only for data table. - updatedHeader.put(HeaderMetadataType.BLOCK_IDENTIFIER, attemptNumber + "," + blockSequenceNumber); -} -if (config.shouldWritePartialUpdates()) { Review Comment: I fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hoodie.datasource.read.file.index.listing.mode is always eager [hudi]
CaesarWangX commented on issue #11201: URL: https://github.com/apache/hudi/issues/11201#issuecomment-2109205844 It seems that this issue has been fixed in version 0.14.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]
CaesarWangX commented on issue #11202: URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109198672 The reason we do not use metadata table is that in spark structured streaming, enabling the metadata table will affect the efficiency of micro batch, as there will be additional list operations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]
danny0405 commented on code in PR #11035: URL: https://github.com/apache/hudi/pull/11035#discussion_r1599312674 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java: ## @@ -112,6 +113,10 @@ public BaseCommitActionExecutor(HoodieEngineContext context, HoodieWriteConfig c public abstract HoodieWriteMetadata execute(I inputRecords); + public HoodieWriteMetadata execute(I inputRecords, Option sourceReadAndIndexTimer) { +return this.execute(inputRecords); Review Comment: Not sure why we need a new `#execute` interface, I see that all the impl executors initialize the timer on the fly while invoking this method, so why not just initialize the timer in the `#execute`itself. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/HoodieWriteMetadata.java: ## @@ -34,6 +34,7 @@ public class HoodieWriteMetadata { private O writeStatuses; private Option indexLookupDuration = Option.empty(); Review Comment: Should we remove this? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseWriteHelper.java: ## @@ -46,22 +47,31 @@ public HoodieWriteMetadata write(String instantTime, int configuredShuffleParallelism, BaseCommitActionExecutor executor, WriteOperationType operationType) { +return this.write(instantTime, inputRecords, context, table, shouldCombine, configuredShuffleParallelism, executor, operationType, Option.empty()); + } + + public HoodieWriteMetadata write(String instantTime, + I inputRecords, + HoodieEngineContext context, + HoodieTable table, + boolean shouldCombine, + int configuredShuffleParallelism, + BaseCommitActionExecutor executor, + WriteOperationType operationType, + Option sourceReadAndIndexTimer) { try { // De-dupe/merge if needed I dedupedRecords = combineOnCondition(shouldCombine, inputRecords, configuredShuffleParallelism, table); - Instant lookupBegin = Instant.now(); I taggedRecords = dedupedRecords; Review Comment: Same question, why not just initialzie the timer here so that we can avoid to introduce a new method. ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java: ## @@ -141,8 +141,8 @@ public JavaRDD upsert(JavaRDD> records, String inst preWrite(instantTime, WriteOperationType.UPSERT, table.getMetaClient()); HoodieWriteMetadata> result = table.upsert(context, instantTime, HoodieJavaRDD.of(records)); HoodieWriteMetadata> resultRDD = result.clone(HoodieJavaRDD.getJavaRDD(result.getWriteStatuses())); -if (result.getIndexLookupDuration().isPresent()) { - metrics.updateIndexMetrics(LOOKUP_STR, result.getIndexLookupDuration().get().toMillis()); +if (result.getSourceReadAndIndexDurationMs().isPresent()) { + metrics.updateSourceReadAndIndexMetrics(LOOKUP_STR, result.getSourceReadAndIndexDurationMs().get()); Review Comment: Should we still use `LOOKUP_STR` here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
hudi-bot commented on PR #11210: URL: https://github.com/apache/hudi/pull/11210#issuecomment-2109186737 ## CI report: * 4e922d2b74ebcf25fe8795aa15e2a99c1e082fe2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7752) Abstract serializeRecords for log writing
[ https://issues.apache.org/jira/browse/HUDI-7752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7752: - Labels: hoodie-storage pull-request-available (was: hoodie-storage) > Abstract serializeRecords for log writing > - > > Key: HUDI-7752 > URL: https://issues.apache.org/jira/browse/HUDI-7752 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage, pull-request-available > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7752] Abstract serializeRecords for log writing [hudi]
yihua opened a new pull request, #11210: URL: https://github.com/apache/hudi/pull/11210 ### Change Logs This PR adds a new API `serializeRecordsToLogBlock` to the `FileFormatUtils` class (renamed from `BaseFileUtils`), to abstract the `serializeRecords` logic in `HoodieParquetDataBlock` and `HoodieHFileDataBlock`. ### Impact Moves Hadoop-dependent logic of serializing Hudi records to log block content to the `hudi-hadoop-common` module. ### Risk level none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
hudi-bot commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109174502 ## CI report: * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN * 153de43462c5b4ac9762cb87e4ded68640995058 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23888) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6563]Supports flink lookup join [hudi]
hudi-bot commented on PR #9228: URL: https://github.com/apache/hudi/pull/9228#issuecomment-2109172992 ## CI report: * 8d29905fdeba6e5b81bdae7b0cdd1166511b1a1a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23889) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]hudi way of doing bucket index cannot be used to improve query engines queries such join and filter [hudi]
ziudu commented on issue #11204: URL: https://github.com/apache/hudi/issues/11204#issuecomment-2109160408 Hi Danny0405, I think the support for 2 hudi tables' Spark sort-merge-join with bucket optimization is an important feature. Currently if we join 2 hudi tables, the bucket index's bucket information is not usable by spark, so shuffle is always needs. As explained in [8657](https://github.com/apache/hudi/pull/8657) - hashing- file naming- file numbering- file sorting are different. Unfortunately, according to https://issues.apache.org/jira/browse/SPARK-19256, spark bucket is not compatible with hive bucket yet. So if we have to choose one between spark and hive, I think spark might be of higher priority. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]
CaesarWangX commented on issue #11202: URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109155774 Hi @danny0405 @xushiyan , We are using spark3.4.1 and hudi0.14.0. Updated the context and please help look into this. Thank you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
jonvex commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109151251 https://github.com/apache/hudi/assets/26940621/806a9c81-a8c6-42f0-9838-07da27cb21e2;> CI passing for https://github.com/apache/hudi/commit/153de43462c5b4ac9762cb87e4ded68640995058 commit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]
hudi-bot commented on PR #11035: URL: https://github.com/apache/hudi/pull/11035#issuecomment-2109136722 ## CI report: * e0d1d604a6331759903f4e825499f89afaac1d00 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23880) * 074845c216002fc00c28dcbb7720ffc05bdc7e8f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23891) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]
danny0405 commented on code in PR #11035: URL: https://github.com/apache/hudi/pull/11035#discussion_r1599280721 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java: ## @@ -207,6 +210,13 @@ public Timer.Context getIndexCtx() { return indexTimer == null ? null : indexTimer.time(); } + public Timer.Context getPreWriteTimerCtx() { +if (config.isMetricsOn() && preWriteTimer == null) { + preWriteTimer = createTimer(preWriteTimerName); +} Review Comment: +1 for `source_read_and_index`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7624] Fixing index tagging duration [hudi]
hudi-bot commented on PR #11035: URL: https://github.com/apache/hudi/pull/11035#issuecomment-2109130452 ## CI report: * e0d1d604a6331759903f4e825499f89afaac1d00 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23880) * 074845c216002fc00c28dcbb7720ffc05bdc7e8f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
hudi-bot commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109123695 ## CI report: * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23885) * 153de43462c5b4ac9762cb87e4ded68640995058 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23888) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]
CaesarWangX commented on issue #11202: URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109118458 Hi @danny0405 , we don't need the metadata table, so as i mentioned, we set metadata.enable=false. We are using hudi in AWS EMR, so we don't have chance to use hudi0.14.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] In hudi 0.14.0, the hoodie.properties file is modified with each micro batch. [hudi]
CaesarWangX commented on issue #11200: URL: https://github.com/apache/hudi/issues/11200#issuecomment-2109114077 @ad1happy2go Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (6627218f71f -> c15bdb34f89)
This is an automated email from the ASF dual-hosted git repository. jonvex pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 6627218f71f [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module (#11207) add c15bdb34f89 remove a few classes from hudi-common (#11209) No new revisions were added by this update. Summary of changes: .../apache/hudi/avro/HoodieBloomFilterWriteSupport.java | 5 +++-- .../java/org/apache/hudi/common/util/BaseFileUtils.java | 9 - .../org/apache/hudi/avro/HoodieAvroWriteSupport.java | 16 +++- .../apache/hudi/common/util/ParquetReaderIterator.java | 0 .../org/apache/hudi/io/hadoop/HoodieAvroOrcWriter.java | 3 +-- .../org/apache/hudi/io/storage/HoodieParquetConfig.java | 0 .../hudi/common/util/TestParquetReaderIterator.java | 0 .../apache/hudi/io/hadoop/TestHoodieOrcReaderWriter.java | 2 +- 8 files changed, 16 insertions(+), 19 deletions(-) rename {hudi-common => hudi-hadoop-common}/src/main/java/org/apache/hudi/avro/HoodieAvroWriteSupport.java (82%) rename {hudi-common => hudi-hadoop-common}/src/main/java/org/apache/hudi/common/util/ParquetReaderIterator.java (100%) rename {hudi-common => hudi-hadoop-common}/src/main/java/org/apache/hudi/io/storage/HoodieParquetConfig.java (100%) rename {hudi-common => hudi-hadoop-common}/src/test/java/org/apache/hudi/common/util/TestParquetReaderIterator.java (100%)
Re: [PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]
jonvex merged PR #11209: URL: https://github.com/apache/hudi/pull/11209 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]
jonvex commented on PR #11209: URL: https://github.com/apache/hudi/pull/11209#issuecomment-2109092003 https://github.com/apache/hudi/assets/26940621/9cc2d116-aae1-4226-b769-e39ec920c1c0;> CI passing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6150] Support bucketing for each hive client [hudi]
danny0405 commented on PR #8657: URL: https://github.com/apache/hudi/pull/8657#issuecomment-2109091522 cc @parisni Are you still on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7752) Abstract serializeRecords for log writing
[ https://issues.apache.org/jira/browse/HUDI-7752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-7752: Story Points: 2 (was: 1) > Abstract serializeRecords for log writing > - > > Key: HUDI-7752 > URL: https://issues.apache.org/jira/browse/HUDI-7752 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Major > Labels: hoodie-storage > Fix For: 0.15.0, 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
hudi-bot commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109080914 ## CI report: * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884) * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23885) * 153de43462c5b4ac9762cb87e4ded68640995058 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23888) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6563]Supports flink lookup join [hudi]
hudi-bot commented on PR #9228: URL: https://github.com/apache/hudi/pull/9228#issuecomment-2109079462 ## CI report: * 55ceb8d72c2eb0e23b7763102959258101a363d1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23872) * 8d29905fdeba6e5b81bdae7b0cdd1166511b1a1a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23889) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
hudi-bot commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109074759 ## CI report: * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884) * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23885) * 153de43462c5b4ac9762cb87e4ded68640995058 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6563]Supports flink lookup join [hudi]
hudi-bot commented on PR #9228: URL: https://github.com/apache/hudi/pull/9228#issuecomment-2109072779 ## CI report: * 55ceb8d72c2eb0e23b7763102959258101a363d1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23872) * 8d29905fdeba6e5b81bdae7b0cdd1166511b1a1a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT]FileID of partition path xxx=xx does not exist. [hudi]
danny0405 commented on issue #11202: URL: https://github.com/apache/hudi/issues/11202#issuecomment-2109071887 Did you use Hudi 0.14.0 release? Did you enable the metadata table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
hudi-bot commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109067049 ## CI report: * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884) * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23885) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated (ea4f14c2851 -> 6627218f71f)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from ea4f14c2851 [HUDI-7744] Introduce IOFactory and a config to set the factory (#11192) add 6627218f71f [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module (#11207) No new revisions were added by this update. Summary of changes: .../org/apache/hudi/common/table/log/HoodieLogFormat.java | 9 - .../hudi/common/table/log/HoodieLogFormatWriter.java | 15 --- 2 files changed, 16 insertions(+), 8 deletions(-) rename {hudi-common => hudi-hadoop-common}/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormatWriter.java (96%)
Re: [PR] [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module [hudi]
yihua merged PR #11207: URL: https://github.com/apache/hudi/pull/11207 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module [hudi]
yihua commented on PR #11207: URL: https://github.com/apache/hudi/pull/11207#issuecomment-2109057451 Azure CI is green. https://github.com/apache/hudi/assets/2497195/784294ec-f41c-4078-819e-f183dd1e5559;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
jonvex commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599229672 ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -58,27 +58,10 @@ public class HoodieHadoopStorage extends HoodieStorage { private final FileSystem fs; - public HoodieHadoopStorage(HoodieStorage storage) { Review Comment: Yeah, I made it the getRawStorage method below -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
jonvex commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599228005 ## hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestUtils.java: ## @@ -223,7 +222,8 @@ public static HoodieTableMetaClient createMetaClient(StorageConfiguration sto */ public static HoodieTableMetaClient createMetaClient(Configuration conf, String basePath) { -return createMetaClient(HoodieStorageUtils.getStorageConfWithCopy(conf), basePath); +return createMetaClient((StorageConfiguration) ReflectionUtils.loadClass(HADOOP_STORAGE_CONF, Review Comment: yeah -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]
hudi-bot commented on PR #11209: URL: https://github.com/apache/hudi/pull/11209#issuecomment-2109025201 ## CI report: * b72b023598810b9d81647fe33c1b0e7de7edf75e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23886) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
hudi-bot commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109025161 ## CI report: * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884) * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN * 22d30e4d7ff5da1ff2118e2d8bcb7373c4a8da88 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]
hudi-bot commented on PR #11209: URL: https://github.com/apache/hudi/pull/11209#issuecomment-2109018806 ## CI report: * b72b023598810b9d81647fe33c1b0e7de7edf75e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
hudi-bot commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109018760 ## CI report: * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884) * ccdd43869f325014e3be41bc41f8c510080dc549 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
hudi-bot commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2109010558 ## CI report: * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
jonvex commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599213734 ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieIOFactory.java: ## @@ -48,4 +48,13 @@ private static HoodieIOFactory getIOFactory(String ioFactoryClass) { public abstract HoodieFileWriterFactory getWriterFactory(HoodieRecord.HoodieRecordType recordType); + public abstract HoodieStorage getStorage(StoragePath storagePath); + + public abstract HoodieStorage getStorage(StoragePath path, Review Comment: Maybe we can just pass `FileSystemRetryConfig`? I am not very familiar with what this is -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
yihua commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599209434 ## hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/HoodieHadoopIOFactory.java: ## @@ -19,28 +19,40 @@ package org.apache.hudi.io.storage; +import org.apache.hudi.common.fs.ConsistencyGuard; import org.apache.hudi.common.model.HoodieRecord; import org.apache.hudi.common.util.ReflectionUtils; import org.apache.hudi.exception.HoodieException; import org.apache.hudi.io.hadoop.HoodieAvroFileReaderFactory; import org.apache.hudi.io.hadoop.HoodieAvroFileWriterFactory; +import org.apache.hudi.storage.HoodieStorage; +import org.apache.hudi.storage.StorageConfiguration; +import org.apache.hudi.storage.StoragePath; +import org.apache.hudi.storage.hadoop.HoodieHadoopStorage; /** * Creates readers and writers for AVRO record payloads. * Currently uses reflection to support SPARK record payloads but * this ability should be removed with [HUDI-7746] */ public class HoodieHadoopIOFactory extends HoodieIOFactory { + protected final StorageConfiguration storageConf; Review Comment: Can the `storageConf` member be put into `HoodieIOFactory`? ## hudi-hadoop-common/src/main/java/org/apache/hudi/storage/hadoop/HoodieHadoopStorage.java: ## @@ -58,27 +58,10 @@ public class HoodieHadoopStorage extends HoodieStorage { private final FileSystem fs; - public HoodieHadoopStorage(HoodieStorage storage) { Review Comment: Is this moved to somewhere else? ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHFileRecordReader.java: ## @@ -59,8 +59,8 @@ public HoodieHFileRecordReader(Configuration conf, InputSplit split, JobConf job StoragePath path = convertToStoragePath(fileSplit.getPath()); StorageConfiguration storageConf = HadoopFSUtils.getStorageConf(conf); HoodieConfig hoodieConfig = getReaderConfigs(storageConf); -reader = HoodieIOFactory.getIOFactory(storageConf).getReaderFactory(HoodieRecord.HoodieRecordType.AVRO) -.getFileReader(hoodieConfig, HadoopFSUtils.getStorageConf(conf), path, HoodieFileFormat.HFILE, Option.empty()); +reader = HoodieIOFactory.getIOFactory(HadoopFSUtils.getStorageConf(conf)).getReaderFactory(HoodieRecord.HoodieRecordType.AVRO) Review Comment: nit: use `storageConf` directly? ## hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java: ## @@ -312,8 +312,8 @@ public static Schema addPartitionFields(Schema schema, List partitioning public static HoodieFileReader getBaseFileReader(Path path, JobConf conf) throws IOException { StorageConfiguration storageConf = HadoopFSUtils.getStorageConf(conf); HoodieConfig hoodieConfig = getReaderConfigs(storageConf); -return HoodieIOFactory.getIOFactory(storageConf).getReaderFactory(HoodieRecord.HoodieRecordType.AVRO) -.getFileReader(hoodieConfig, HadoopFSUtils.getStorageConf(conf), convertToStoragePath(path)); +return HoodieIOFactory.getIOFactory(HadoopFSUtils.getStorageConf(conf)).getReaderFactory(HoodieRecord.HoodieRecordType.AVRO) Review Comment: Same here for using `storageConf` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
jonvex commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599210706 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -470,9 +470,9 @@ public void performMergeDataValidationCheck(WriteStatus writeStatus) { } long oldNumWrites = 0; -try (HoodieFileReader reader = HoodieIOFactory.getIOFactory(storage.getConf()) +try (HoodieFileReader reader = HoodieIOFactory.getIOFactory(hoodieTable.getStorageConf()) Review Comment: yes. storage.getConf() uses the fs which might use the cached conf as we discussed before with the issues with iofactory class config going missing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
jonvex commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599210077 ## hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroFileReaderFactory.java: ## @@ -36,60 +34,48 @@ import java.io.IOException; public class HoodieAvroFileReaderFactory extends HoodieFileReaderFactory { - public static final String HBASE_AVRO_HFILE_READER = "org.apache.hudi.io.hadoop.HoodieHBaseAvroHFileReader"; + + public HoodieAvroFileReaderFactory(StorageConfiguration storageConf) { +super(storageConf); + } @Override - protected HoodieFileReader newParquetFileReader(StorageConfiguration conf, StoragePath path) { -return new HoodieAvroParquetReader(conf, path); + protected HoodieFileReader newParquetFileReader(StoragePath path) { +return new HoodieAvroParquetReader(storageConf, path); } @Override protected HoodieFileReader newHFileFileReader(HoodieConfig hoodieConfig, -StorageConfiguration conf, StoragePath path, Option schemaOption) throws IOException { if (isUseNativeHFileReaderEnabled(hoodieConfig)) { - return new HoodieNativeAvroHFileReader(conf, path, schemaOption); + return new HoodieNativeAvroHFileReader(storageConf, path, schemaOption); } -try { - if (schemaOption.isPresent()) { -return (HoodieFileReader) ReflectionUtils.loadClass(HBASE_AVRO_HFILE_READER, -new Class[] {StorageConfiguration.class, StoragePath.class, Option.class}, conf, path, schemaOption); - } - return (HoodieFileReader) ReflectionUtils.loadClass(HBASE_AVRO_HFILE_READER, - new Class[] {StorageConfiguration.class, StoragePath.class}, conf, path); -} catch (HoodieException e) { - throw new IOException("Cannot instantiate HoodieHBaseAvroHFileReader", e); +if (schemaOption.isPresent()) { + return new HoodieHBaseAvroHFileReader(storageConf, path, schemaOption); Review Comment: checked, and I don't think there is anything else that we missed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
yihua commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599206501 ## hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroFileReaderFactory.java: ## @@ -36,60 +34,48 @@ import java.io.IOException; public class HoodieAvroFileReaderFactory extends HoodieFileReaderFactory { - public static final String HBASE_AVRO_HFILE_READER = "org.apache.hudi.io.hadoop.HoodieHBaseAvroHFileReader"; + + public HoodieAvroFileReaderFactory(StorageConfiguration storageConf) { +super(storageConf); + } @Override - protected HoodieFileReader newParquetFileReader(StorageConfiguration conf, StoragePath path) { -return new HoodieAvroParquetReader(conf, path); + protected HoodieFileReader newParquetFileReader(StoragePath path) { +return new HoodieAvroParquetReader(storageConf, path); } @Override protected HoodieFileReader newHFileFileReader(HoodieConfig hoodieConfig, -StorageConfiguration conf, StoragePath path, Option schemaOption) throws IOException { if (isUseNativeHFileReaderEnabled(hoodieConfig)) { - return new HoodieNativeAvroHFileReader(conf, path, schemaOption); + return new HoodieNativeAvroHFileReader(storageConf, path, schemaOption); } -try { - if (schemaOption.isPresent()) { -return (HoodieFileReader) ReflectionUtils.loadClass(HBASE_AVRO_HFILE_READER, -new Class[] {StorageConfiguration.class, StoragePath.class, Option.class}, conf, path, schemaOption); - } - return (HoodieFileReader) ReflectionUtils.loadClass(HBASE_AVRO_HFILE_READER, - new Class[] {StorageConfiguration.class, StoragePath.class}, conf, path); -} catch (HoodieException e) { - throw new IOException("Cannot instantiate HoodieHBaseAvroHFileReader", e); +if (schemaOption.isPresent()) { + return new HoodieHBaseAvroHFileReader(storageConf, path, schemaOption); Review Comment: Good catch! Could you check separately on all the reflection usage on `HoodieStorage`, readers and writers and see if they are still needed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
yihua commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599192983 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java: ## @@ -105,4 +109,4 @@ private static HoodieRowParquetWriteSupport getHoodieRowParquetWriteSupport(Stor StructType structType = HoodieInternalRowUtils.getCachedSchema(schema); return HoodieRowParquetWriteSupport.getHoodieRowParquetWriteSupport(conf.unwrapAs(Configuration.class), structType, filter, config); } -} +} Review Comment: nit: keep the new line ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkIOFactory.java: ## @@ -20,30 +20,34 @@ package org.apache.hudi.io.storage; import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.storage.StorageConfiguration; /** * Creates readers and writers for SPARK and AVRO record payloads */ public class HoodieSparkIOFactory extends HoodieHadoopIOFactory { - private static final HoodieSparkIOFactory HOODIE_SPARK_IO_FACTORY = new HoodieSparkIOFactory(); - public static HoodieSparkIOFactory getHoodieSparkIOFactory() { -return HOODIE_SPARK_IO_FACTORY; + public HoodieSparkIOFactory(StorageConfiguration storageConf) { +super(storageConf); + } + + public static HoodieSparkIOFactory getHoodieSparkIOFactory(StorageConfiguration storageConf) { +return new HoodieSparkIOFactory(storageConf); } @Override public HoodieFileReaderFactory getReaderFactory(HoodieRecord.HoodieRecordType recordType) { if (recordType == HoodieRecord.HoodieRecordType.SPARK) { - return new HoodieSparkFileReaderFactory(); + return new HoodieSparkFileReaderFactory(storageConf); } return super.getReaderFactory(recordType); } @Override public HoodieFileWriterFactory getWriterFactory(HoodieRecord.HoodieRecordType recordType) { if (recordType == HoodieRecord.HoodieRecordType.SPARK) { - return new HoodieSparkFileWriterFactory(); + return new HoodieSparkFileWriterFactory(storageConf); } return super.getWriterFactory(recordType); } -} +} Review Comment: Similar here for all files. ## hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestUtils.java: ## @@ -223,7 +222,8 @@ public static HoodieTableMetaClient createMetaClient(StorageConfiguration sto */ public static HoodieTableMetaClient createMetaClient(Configuration conf, String basePath) { -return createMetaClient(HoodieStorageUtils.getStorageConfWithCopy(conf), basePath); +return createMetaClient((StorageConfiguration) ReflectionUtils.loadClass(HADOOP_STORAGE_CONF, Review Comment: Should this static method be moved to `hudi-hadoop-common` and directly use the constructor of `HadoopStorageConfiguration`? ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieFileWriterFactory.java: ## @@ -120,4 +125,4 @@ public static boolean enableBloomFilter(boolean populateMetaFields, HoodieConfig // so the class HoodieIndexConfig cannot be accessed in hudi-common, otherwise there will be a circular dependency problem || (config.contains("hoodie.index.type") && config.getString("hoodie.index.type").contains("BLOOM"))); } -} +} Review Comment: nit: keep new line. ## hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieIOFactory.java: ## @@ -48,4 +48,13 @@ private static HoodieIOFactory getIOFactory(String ioFactoryClass) { public abstract HoodieFileWriterFactory getWriterFactory(HoodieRecord.HoodieRecordType recordType); + public abstract HoodieStorage getStorage(StoragePath storagePath); + + public abstract HoodieStorage getStorage(StoragePath path, Review Comment: How much effort is required to remove this API? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7754] Remove AvroWriteSupport and ParquetReaderIterator from hudi-common [hudi]
jonvex opened a new pull request, #11209: URL: https://github.com/apache/hudi/pull/11209 ### Change Logs These classes need to be removed from hudi common because they have hadoop deps. ### Impact remove hadoop from hudi-common ### Risk level (write none, low medium or high below) low ### Documentation Update N/A ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7754) Remove AvroWriteSupport and ParquetReaderIterator from hudi-common
[ https://issues.apache.org/jira/browse/HUDI-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7754: - Labels: pull-request-available (was: ) > Remove AvroWriteSupport and ParquetReaderIterator from hudi-common > -- > > Key: HUDI-7754 > URL: https://issues.apache.org/jira/browse/HUDI-7754 > Project: Apache Hudi > Issue Type: Task >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0, 1.0.0 > > > 2 classes with hadoop deps that can be moved to hadoop common and aren't > covered by other prs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7754) Remove AvroWriteSupport and ParquetReaderIterator from hudi-common
Jonathan Vexler created HUDI-7754: - Summary: Remove AvroWriteSupport and ParquetReaderIterator from hudi-common Key: HUDI-7754 URL: https://issues.apache.org/jira/browse/HUDI-7754 Project: Apache Hudi Issue Type: Task Reporter: Jonathan Vexler Assignee: Jonathan Vexler Fix For: 0.15.0, 1.0.0 2 classes with hadoop deps that can be moved to hadoop common and aren't covered by other prs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
yihua commented on code in PR #11208: URL: https://github.com/apache/hudi/pull/11208#discussion_r1599190553 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileReaderFactory.java: ## @@ -31,34 +31,37 @@ public class HoodieSparkFileReaderFactory extends HoodieFileReaderFactory { + public HoodieSparkFileReaderFactory(StorageConfiguration storageConf) { +super(storageConf); + } + @Override - public HoodieFileReader newParquetFileReader(StorageConfiguration conf, StoragePath path) { -conf.setIfUnset(SQLConf.PARQUET_BINARY_AS_STRING().key(), SQLConf.PARQUET_BINARY_AS_STRING().defaultValueString()); -conf.setIfUnset(SQLConf.PARQUET_INT96_AS_TIMESTAMP().key(), SQLConf.PARQUET_INT96_AS_TIMESTAMP().defaultValueString()); -conf.setIfUnset(SQLConf.CASE_SENSITIVE().key(), SQLConf.CASE_SENSITIVE().defaultValueString()); + public HoodieFileReader newParquetFileReader(StoragePath path) { +storageConf.setIfUnset(SQLConf.PARQUET_BINARY_AS_STRING().key(), SQLConf.PARQUET_BINARY_AS_STRING().defaultValueString()); +storageConf.setIfUnset(SQLConf.PARQUET_INT96_AS_TIMESTAMP().key(), SQLConf.PARQUET_INT96_AS_TIMESTAMP().defaultValueString()); +storageConf.setIfUnset(SQLConf.CASE_SENSITIVE().key(), SQLConf.CASE_SENSITIVE().defaultValueString()); // Using string value of this conf to preserve compatibility across spark versions. -conf.setIfUnset("spark.sql.legacy.parquet.nanosAsLong", "false"); +storageConf.setIfUnset("spark.sql.legacy.parquet.nanosAsLong", "false"); // This is a required config since Spark 3.4.0: SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED // Using string value of this conf to preserve compatibility across spark versions. -conf.setIfUnset("spark.sql.parquet.inferTimestampNTZ.enabled", "true"); -return new HoodieSparkParquetReader(conf, path); +storageConf.setIfUnset("spark.sql.parquet.inferTimestampNTZ.enabled", "true"); +return new HoodieSparkParquetReader(storageConf, path); } @Override protected HoodieFileReader newHFileFileReader(HoodieConfig hoodieConfig, -StorageConfiguration conf, StoragePath path, Option schemaOption) throws IOException { throw new HoodieIOException("Not support read HFile"); } @Override - protected HoodieFileReader newOrcFileReader(StorageConfiguration conf, StoragePath path) { + protected HoodieFileReader newOrcFileReader(StoragePath path) { throw new HoodieIOException("Not support read orc file"); } @Override public HoodieFileReader newBootstrapFileReader(HoodieFileReader skeletonFileReader, HoodieFileReader dataFileReader, Option partitionFields, Object[] partitionValues) { return new HoodieSparkBootstrapFileReader(skeletonFileReader, dataFileReader, partitionFields, partitionValues); } -} +} Review Comment: nit: keep the new line. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -470,9 +470,9 @@ public void performMergeDataValidationCheck(WriteStatus writeStatus) { } long oldNumWrites = 0; -try (HoodieFileReader reader = HoodieIOFactory.getIOFactory(storage.getConf()) +try (HoodieFileReader reader = HoodieIOFactory.getIOFactory(hoodieTable.getStorageConf()) Review Comment: Is there a difference between `storage.getConf()` and `hoodieTable.getStorageConf()`? Probably we can remove `HoodieStorage` and `FileSystem` instances in the `HoodieIOHandle` (in a separate PR). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7589] Add API to create HoodieStorage in HoodieIOFactory [hudi]
hudi-bot commented on PR #11208: URL: https://github.com/apache/hudi/pull/11208#issuecomment-2108947219 ## CI report: * 3df79138ad8473c1c5aef458a6a46ecbf1879e3d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23884) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module [hudi]
hudi-bot commented on PR #11207: URL: https://github.com/apache/hudi/pull/11207#issuecomment-2108947189 ## CI report: * 7f64c7a7d1cac235655a24ca1ff5494d5afa7a86 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23883) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]
yihua commented on code in PR #10922: URL: https://github.com/apache/hudi/pull/10922#discussion_r1599188828 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -654,22 +640,6 @@ private HoodieLogBlock.HoodieLogBlockType pickLogDataBlockFormat() { } } - private static Map getUpdatedHeader(Map header, int blockSequenceNumber, long attemptNumber, - HoodieWriteConfig config, boolean addBlockIdentifier) { -Map updatedHeader = new HashMap<>(header); -if (addBlockIdentifier && !HoodieTableMetadata.isMetadataTable(config.getBasePath())) { // add block sequence numbers only for data table. - updatedHeader.put(HeaderMetadataType.BLOCK_IDENTIFIER, attemptNumber + "," + blockSequenceNumber); -} -if (config.shouldWritePartialUpdates()) { Review Comment: @jonvex Good catch! Could you fix it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7750] Move HoodieLogFormatWriter class to hoodie-hadoop-common module [hudi]
hudi-bot commented on PR #11207: URL: https://github.com/apache/hudi/pull/11207#issuecomment-2108939810 ## CI report: * 7f64c7a7d1cac235655a24ca1ff5494d5afa7a86 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7549] Reverting spurious log block deduction with LogRecordReader [hudi]
jonvex commented on code in PR #10922: URL: https://github.com/apache/hudi/pull/10922#discussion_r1599181476 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -654,22 +640,6 @@ private HoodieLogBlock.HoodieLogBlockType pickLogDataBlockFormat() { } } - private static Map getUpdatedHeader(Map header, int blockSequenceNumber, long attemptNumber, - HoodieWriteConfig config, boolean addBlockIdentifier) { -Map updatedHeader = new HashMap<>(header); -if (addBlockIdentifier && !HoodieTableMetadata.isMetadataTable(config.getBasePath())) { // add block sequence numbers only for data table. - updatedHeader.put(HeaderMetadataType.BLOCK_IDENTIFIER, attemptNumber + "," + blockSequenceNumber); -} -if (config.shouldWritePartialUpdates()) { Review Comment: @nsivabalan we need to keep this part with the partial update flag -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org