Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
nsivabalan commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-2043996684 hey @ad1happy2go : if this turns out to be MDT data consistency issue, do keep me posted. thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
ad1happy2go commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-1980861181 @shravanak Are you still facing this issue? Let us know in case you need help here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
ad1happy2go commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-1943189279 @shravanak That may be the cause probably. Did you faced this issue with other tables also? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
shravanak commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-1933499929 We are using insert write mode with hudi 0.14.0 I think the file or partition it is referring to missing might be before we upgraded to 0.14.0 which was on 0.12.2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
ad1happy2go commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-1933440282 @Shubham21k Code link here - https://gist.github.com/ad1happy2go/364e66c4fa84229110f28994cc4a277f Async services are meant to run with streaming workloads like Hudi Streamer, so that table services can run asynchronously and doesn't block the ingestion of next micro batch. Having it with Data source writers (batch writers) doesn't make any sense and inline table services will be kicked in. @shravanak Which Hudi version you are using? Are you also using insert_overwrite. Can you elaborate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
shravanak commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-1933190181 Is there a work around for this issue. We are facing a similar issue as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
Shubham21k commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-1931428449 @ad1happy2go 1. all the select queries are failing due to this error. As suggested, will try and check if this gets fixed in 0.14.1 2. 'async services doesn't work with Datasource writer' - can you elaborate more as we have observed the cleaner & archival taking place with these configurations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
ad1happy2go commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-1930407268 @Shubham21k What queries you are trying on this data? Does select * works? For point in time queries, this error is expected in case the commit is not archive but cleaned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] File not found while using metadata table for insert_overwrite table [hudi]
ad1happy2go commented on issue #10628: URL: https://github.com/apache/hudi/issues/10628#issuecomment-1930403617 @Shubham21k Can you try with 0.14.1 once. Also, async services doesn't work with Datasource writer. I tried to reproduce this but unable to do it. Can you check in case you can enhance the same and reproduce it. Code here - https://gist.github.com/ad1happy2go/364e66c4fa84229110f28994cc4a277f/edit -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] File not found while using metadata table for insert_overwrite table [hudi]
Shubham21k opened a new issue, #10628: URL: https://github.com/apache/hudi/issues/10628 We are incrementally writing to a hudi table with insert_overwrite operations. Recently, We enabled Hudi metadata table for these tables. However after few days we started to encounter the `FileNotFoundException` issue while reading these tables from athena (with metadata listing enabled). Upon further investigation, we observed that the metadata contains older files that were cleaned up by the cleaner and are no longer available. Steps to reproduce the behavior: 1. create a simple df and write to a hudi table incrementally with these properties ``` hoodie.datasource.meta.sync.enable=true hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool hoodie.write.markers.type=DIRECT **hoodie.metadata.enable=true hoodie.datasource.write.operation=insert_overwrite** hoodie.datasource.write.partitionpath.field=cs_load_hr hoodie.datasource.hive_sync.partition_fields=cs_load_hr partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd/HH hoodie.deltastreamer.source.hoodieincr.partition.extractor.class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor hoodie.parquet.compression.codec=snappy hoodie.table.services.enabled=true hoodie.rollback.using.markers=false hoodie.commits.archival.batch=30 hoodie.archive.delete.parallelism=500 hoodie.index.type=SIMPLE hoodie.clean.allow.multiple=false hoodie.clean.async=true hoodie.clean.automatic=true hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.cleaner.commits.retained=3 hoodie.cleaner.parallelism=500 hoodie.cleaner.incremental.mode=true hoodie.clean.max.commits=8 hoodie.archive.async=true hoodie.archive.automatic=true hoodie.archive.merge.enable=true hoodie.archive.merge.files.batch.size=60 hoodie.keep.max.commits=10 hoodie.keep.min.commits=5 ``` df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiOutputTablePath) 2. after few incremental writes, some of the base files should be updated & metadata does not get updated properly, it continues to persist old files pointer as well. 3. if you try reading the table using spark or athena, you will get FileNotFoundException keep in mind to enable metadata while reading. upon disabling the metadata listing on the read side, there is no error and reads work fine. 4. Note : We have observed this issue only for **insert_overwrite** operations. Upsert operation table's metadata gets updated correctly. **Expected behavior** It is expected that the hoodie metadata gets updated correctly. **Environment Description** * Hudi version : 0.13.1 * Spark version : 3.2.1 * Hive version : NA * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : **Additional context** The timeline also contains replaceCommits for corrupted tables. (which are not present in case of upsert table) ``` $ aws s3 ls s3://tmp-data/investments_ctr_tbl/.hoodie/ PRE .aux/ PRE archived/ PRE metadata/ 2023-12-08 13:32:17 0 .aux_$folder$ 2023-12-08 13:32:17 0 .schema_$folder$ 2023-12-08 13:32:17 0 .temp_$folder$ 2023-12-14 22:17:18 4678 20231214221641350.clean 2023-12-14 22:17:11 3227 20231214221641350.clean.inflight 2023-12-14 22:17:10 3227 20231214221641350.clean.requested 2023-12-22 21:50:54 4439 2023114849300.clean 2023-12-22 21:50:45 4337 2023114849300.clean.inflight 2023-12-22 21:50:45 4337 2023114849300.clean.requested 2023-12-30 21:51:16 4439 20231230214431936.clean 2023-12-30 21:51:07 4337 20231230214431936.clean.inflight 2023-12-30 21:51:07 4337 20231230214431936.clean.requested 2024-01-07 21:53:30 4439 20240107215204594.clean 2024-01-07 21:53:23 4337 20240107215204594.clean.inflight 2024-01-07 21:53:22 4337 20240107215204594.clean.requested 2024-01-15 21:55:00 4439 20240115215112126.clean 2024-01-15 21:54:52 4337 20240115215112126.clean.inflight 2024-01-15 21:54:52 4337 20240115215112126.clean.requested 2024-01-23 21:46:53 4439 20240123214442067.clean 2024-01-23 21:46:45 4337 20240123214442067.clean.inflight 2024-01-23 21:46:45 4337 2024012321