Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
Limess commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2213778295 Thanks - in that case I'll close this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
Limess closed issue #11554: [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue URL: https://github.com/apache/hudi/issues/11554 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
danny0405 commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2209847528 Yeah, this feature is introduced since 0.15.0: https://issues.apache.org/jira/browse/HUDI-7466 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
Limess commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2209271335 As shown in the stacktrace above, this failed in production rather than trying in batches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
ad1happy2go commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2209229250 @Limess But it will batch it and fire Multiple API calls to delete all partitions. AWS had a limit of 25 partitions for delete api, that's the reason we are having this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
Limess commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2208324162 That seems to be far exceeding `MAX_DELETE_PARTITIONS_PER_REQUEST = 25;` if that's the case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
danny0405 commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2207698467 The cmd writes an empty data frame using spark writer: https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableDropPartitionCommand.scala, and that would trigger a batch sync of partitions with the https://github.com/apache/hudi/blob/master/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
Limess commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2205598516 I threw together a spark script to drop all but the `n` latest partitions for our use case longer term. I still think this is a bug and the DROP PARTITIONS command to AWS Glue should be robust enough to not send too many partitions at once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
danny0405 commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2204769477 we have drop partition cmd support, does that make sense to you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
Limess commented on issue #11554: URL: https://github.com/apache/hudi/issues/11554#issuecomment-2204071448 We should probably add some management ourselves to limit the partitions, is there any advice/pre-canned example of limiting to say, 450 past partitions and deleting any older using spark? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]
Limess opened a new issue, #11554: URL: https://github.com/apache/hudi/issues/11554 **Describe the problem you faced** Glue sync fails with INSERT_OVERWRITE when previous partitions are not included in new load. In our case, we have a couple of years worth of data, but only want to load the last `n` days, overwriting the old table. Deleting the old, now defunct partitions fails. **To Reproduce** Steps to reproduce the behavior: 1. Create a Hudi table in AWS Glue with partition name strings exceeding 2048 combined + additional which *will* be included in step 2 2. INSERT_OVERWRITE into the same table, excluding any previous partitions, exceeding 2048 combined 3. Observe failure **Expected behavior** Partitions are removed from AWS Glue catalog correctly. **Environment Description** * Hudi version : 0.14.1-amzn-0 * Spark version : 3.5.0 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** EMR 7.1.0 **Stacktrace** [stacktrace.txt](https://github.com/user-attachments/files/16072826/stacktrace.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org