Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-08 Thread via GitHub


Limess commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2213778295

   Thanks - in that case I'll close this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-08 Thread via GitHub


Limess closed issue #11554: [SUPPORT] INSERT_OVERWRITE failed with large number 
of partitions on AWS Glue
URL: https://github.com/apache/hudi/issues/11554


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-04 Thread via GitHub


danny0405 commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2209847528

   Yeah, this feature is introduced since 0.15.0: 
https://issues.apache.org/jira/browse/HUDI-7466


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-04 Thread via GitHub


Limess commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2209271335

   As shown in the stacktrace above, this failed in production rather than 
trying in batches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-04 Thread via GitHub


ad1happy2go commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2209229250

   @Limess But it will batch it and fire Multiple API calls to delete all 
partitions. AWS had a limit of 25 partitions for delete api, that's the reason 
we are having this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-04 Thread via GitHub


Limess commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2208324162

   That seems to be far exceeding `MAX_DELETE_PARTITIONS_PER_REQUEST = 25;` if 
that's the case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-03 Thread via GitHub


danny0405 commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2207698467

   The cmd writes an empty data frame using spark writer: 
https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/AlterHoodieTableDropPartitionCommand.scala,
 and that would trigger a batch sync of partitions with the 
https://github.com/apache/hudi/blob/master/hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-03 Thread via GitHub


Limess commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2205598516

   I threw together a spark script to drop all but the `n` latest partitions 
for our use case longer term.
   
   I still think this is a bug and the DROP PARTITIONS command to AWS Glue 
should be robust enough to not send too many partitions at once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-02 Thread via GitHub


danny0405 commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2204769477

   we have drop partition cmd support, does that make sense to you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-02 Thread via GitHub


Limess commented on issue #11554:
URL: https://github.com/apache/hudi/issues/11554#issuecomment-2204071448

   We should probably add some management ourselves to limit the partitions, is 
there any advice/pre-canned example of limiting to say, 450 past partitions and 
deleting any older using spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] INSERT_OVERWRITE failed with large number of partitions on AWS Glue [hudi]

2024-07-02 Thread via GitHub


Limess opened a new issue, #11554:
URL: https://github.com/apache/hudi/issues/11554

   **Describe the problem you faced**
   
   Glue sync fails with INSERT_OVERWRITE when previous partitions are not 
included in new load.
   
   In our case, we have a couple of years worth of data, but only want to load 
the last `n` days, overwriting the old table. Deleting the old, now defunct 
partitions fails.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a Hudi table in AWS Glue with partition name strings exceeding 
2048 combined + additional which *will* be included in step 2
   2. INSERT_OVERWRITE into the same table, excluding any previous partitions, 
exceeding 2048 combined
   3. Observe failure
   
   **Expected behavior**
   
   Partitions are removed from AWS Glue catalog correctly.
   
   **Environment Description**
   
   * Hudi version : 0.14.1-amzn-0
   
   * Spark version : 3.5.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   EMR 7.1.0
   
   **Stacktrace**
   
   
[stacktrace.txt](https://github.com/user-attachments/files/16072826/stacktrace.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org