Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-04-08 Thread via GitHub


nsivabalan commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-2043996684

   hey @ad1happy2go : if this turns out to be MDT data consistency issue, do 
keep me posted. thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-03-06 Thread via GitHub


ad1happy2go commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-1980861181

   @shravanak Are you still facing this issue? Let us know in case you need 
help here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-02-13 Thread via GitHub


ad1happy2go commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-1943189279

   @shravanak That  may be the cause probably. Did you faced this issue with 
other tables also?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-02-07 Thread via GitHub


shravanak commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-1933499929

   We are using insert write mode with hudi 0.14.0
   I think the file or partition it is referring to missing might be before we 
upgraded to 0.14.0 which was on 0.12.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-02-07 Thread via GitHub


ad1happy2go commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-1933440282

   @Shubham21k Code link here - 
https://gist.github.com/ad1happy2go/364e66c4fa84229110f28994cc4a277f
   
   Async services are meant to run with streaming workloads like Hudi Streamer, 
so that table services can run asynchronously and doesn't block the ingestion 
of next micro batch. Having it with Data source writers (batch writers) doesn't 
make any sense and inline table services will be kicked in.
   
   @shravanak Which Hudi version you are using?  Are you also using 
insert_overwrite. Can you elaborate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-02-07 Thread via GitHub


shravanak commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-1933190181

   Is there a work around for this issue. We are facing a similar issue as 
well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-02-06 Thread via GitHub


Shubham21k commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-1931428449

   @ad1happy2go 
   1. all the select queries are failing due to this error. As suggested, will 
try and check if this gets fixed in 0.14.1
   2. 'async services doesn't work with Datasource writer' - can you elaborate 
more as we have observed the cleaner & archival taking place with these 
configurations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-02-06 Thread via GitHub


ad1happy2go commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-1930407268

   @Shubham21k What queries you are trying on this data? Does select * works? 
   For point in time queries, this error is expected in case the commit is not 
archive but cleaned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] File not found while using metadata table for insert_overwrite table [hudi]

2024-02-06 Thread via GitHub


ad1happy2go commented on issue #10628:
URL: https://github.com/apache/hudi/issues/10628#issuecomment-1930403617

   @Shubham21k Can you try with 0.14.1 once. Also, async services doesn't work 
with Datasource writer.
   
   I tried to reproduce this but unable to do it. Can you check in case you can 
enhance the same and reproduce it. 
   
   Code here - 
https://gist.github.com/ad1happy2go/364e66c4fa84229110f28994cc4a277f/edit
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] File not found while using metadata table for insert_overwrite table [hudi]

2024-02-05 Thread via GitHub


Shubham21k opened a new issue, #10628:
URL: https://github.com/apache/hudi/issues/10628

   
   We are incrementally writing to a hudi table with insert_overwrite 
operations. Recently, We enabled Hudi metadata table for these tables. However 
after few days we started to encounter the `FileNotFoundException` issue while 
reading these tables from athena (with metadata listing enabled). Upon further 
investigation, we observed that the metadata contains older files that were 
cleaned up by the cleaner and are no longer available.
   
   
   
   Steps to reproduce the behavior:
   1. create a simple df and write to a hudi table incrementally with these 
properties
   ```
   hoodie.datasource.meta.sync.enable=true 
   
hoodie.meta.sync.client.tool.class=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
 
   hoodie.write.markers.type=DIRECT 
   **hoodie.metadata.enable=true 
   hoodie.datasource.write.operation=insert_overwrite** 
   hoodie.datasource.write.partitionpath.field=cs_load_hr 
   hoodie.datasource.hive_sync.partition_fields=cs_load_hr 
   
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor 
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
 
   hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING 
   hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd/HH 
   
hoodie.deltastreamer.source.hoodieincr.partition.extractor.class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor
 
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedHourPartitionValueExtractor
 
   hoodie.parquet.compression.codec=snappy 
   hoodie.table.services.enabled=true 
   hoodie.rollback.using.markers=false 
   hoodie.commits.archival.batch=30 
   hoodie.archive.delete.parallelism=500 
   hoodie.index.type=SIMPLE 
   hoodie.clean.allow.multiple=false 
   hoodie.clean.async=true 
   hoodie.clean.automatic=true 
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS 
   hoodie.cleaner.commits.retained=3 
   hoodie.cleaner.parallelism=500 
   hoodie.cleaner.incremental.mode=true 
   hoodie.clean.max.commits=8 
   hoodie.archive.async=true 
   hoodie.archive.automatic=true 
   hoodie.archive.merge.enable=true 
   hoodie.archive.merge.files.batch.size=60 
   hoodie.keep.max.commits=10 
   hoodie.keep.min.commits=5
   ```
   
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiOutputTablePath)
   
   2. after few incremental writes, some of the base files should be updated & 
metadata does not get updated properly, it continues to persist old files 
pointer as well.
   3. if you try reading the table using spark or athena, you will get 
FileNotFoundException 
   keep in mind to enable metadata while reading. upon disabling the metadata 
listing on the read side, there is no error and reads work fine.
   4. Note : We have observed this issue only for **insert_overwrite** 
operations. Upsert operation table's metadata gets updated correctly.
   
   **Expected behavior**
   
   It is expected that the hoodie metadata gets updated correctly.
   
   **Environment Description**
   
   * Hudi version : 0.13.1
   
   * Spark version : 3.2.1
   
   * Hive version : NA
   
   * Hadoop version : 
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   The timeline also contains replaceCommits for corrupted tables. (which are 
not present in case of upsert table) 
   
   ```
   $ aws s3 ls s3://tmp-data/investments_ctr_tbl/.hoodie/
  PRE .aux/
  PRE archived/
  PRE metadata/
   2023-12-08 13:32:17  0 .aux_$folder$
   2023-12-08 13:32:17  0 .schema_$folder$
   2023-12-08 13:32:17  0 .temp_$folder$
   2023-12-14 22:17:18   4678 20231214221641350.clean
   2023-12-14 22:17:11   3227 20231214221641350.clean.inflight
   2023-12-14 22:17:10   3227 20231214221641350.clean.requested
   2023-12-22 21:50:54   4439 2023114849300.clean
   2023-12-22 21:50:45   4337 2023114849300.clean.inflight
   2023-12-22 21:50:45   4337 2023114849300.clean.requested
   2023-12-30 21:51:16   4439 20231230214431936.clean
   2023-12-30 21:51:07   4337 20231230214431936.clean.inflight
   2023-12-30 21:51:07   4337 20231230214431936.clean.requested
   2024-01-07 21:53:30   4439 20240107215204594.clean
   2024-01-07 21:53:23   4337 20240107215204594.clean.inflight
   2024-01-07 21:53:22   4337 20240107215204594.clean.requested
   2024-01-15 21:55:00   4439 20240115215112126.clean
   2024-01-15 21:54:52   4337 20240115215112126.clean.inflight
   2024-01-15 21:54:52   4337 20240115215112126.clean.requested
   2024-01-23 21:46:53   4439 20240123214442067.clean
   2024-01-23 21:46:45   4337 20240123214442067.clean.inflight
   2024-01-23 21:46:45   4337 2024012321