sydneyhoran commented on issue #12734: URL: https://github.com/apache/hudi/issues/12734#issuecomment-2657453441
Hi @ad1happy2go - I am @sweir-thescore's teammate. We don't have any live examples as we had to repair them all in real time. But this is an example of what it looks like when we have an incomplete rollback (only the top two `.rollback.requested` and `.rollback.inflight` files) - can ignore the highlighted file. Once these 2 files are manually deleted, jobs typically succeed, but may eventually get out of sync again or throw one of the other types of error. <img width="1643" alt="Image" src="https://github.com/user-attachments/assets/e5c57f09-ee43-4812-8262-f4c18be9da32" /> Every subsequent job will fail and throw an error like: `Caused by: org.apache.hudi.exception.HoodieRollbackException: Found commits after time :20240913164753886, please rollback greater commits first` `org.apache.hudi.timeline.service.RequestHandler: Bad request response due to client view behind server view` `HoodieMetadataException: Metadata table's deltacommits exceeded 1000: this is likely caused by a pending instant in the data table` `Caused by: org.apache.hudi.exception.HoodieIOException: Failed to read footer for parquet gs://.../inserted_at_date=2025-01-19/..._20250129044440519.parquet` `Caused by: java.io.FileNotFoundException: File not found: gs://.../inserted_at_date=2025-01-19/..._20250129044440519.parquet` (although these may be separate issues on their own) Of note, we only see this error when running a GCE cluster with a dedicated driver pool. We switched back to the regular node type of GCE cluster and no longer face this issue when a job is cancelled or fails. The spark Drivers on the dedicated driver pool also required about 5x more memory (i.e. 5GB instead of 1GB on the current cluster), and still sometimes faced OOMs (that will lead to the below errors). We are also investigating this within GCP/Dataproc and replanning our approach to how we want to architect the cluster, but these metadata/timeline issues were the primary reason why we could not switch to the new cluster configuration. So wanted to check if any thoughts here as well. Thanks in advance! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
