Re: [DISCUSS] DROP PARTITION in Spark

2024-08-29 Thread Gabor Kaszab
Thanks for the answers! Sorry, I didn't drop the subject I just had other priorities too but still find this topic interesting to discuss. Understood, DROP PARTITION can't happen. *Thanks Anton* for showing some interest and sharing some alternatives! I checked the canDeleteWhere() and canDeleteU

Re: [DISCUSS] DROP PARTITION in Spark

2024-08-06 Thread Anton Okolnychyi
I still wonder if there is a clean way for users to ensure a DELETE statement is purely a metadata operation. We, of course, should focus on declarative commands but the cost of executing a row-level DELETE can be unacceptable in some cases. I remember multiple teams were asking for that to prevent

Re: [DISCUSS] DROP PARTITION in Spark

2024-08-06 Thread Xianjin YE
> We want users to think _less_ about how their operations are physically > carried out. It is the responsibility of Iceberg and Spark to reduce the cost > so that the user doesn't need to care. > They should tell Spark what should happen, not how to do it. I completely agree with your points. T

Re: [DISCUSS] DROP PARTITION in Spark

2024-08-02 Thread Ryan Blue
> That’s true, maybe we can start with a session conf or consulting the Spark community to add the ability to enforce deletion via metadata operation only? I don't think this is the right direction. We want users to think _less_ about how their operations are physically carried out. It is the resp

Re: [DISCUSS] DROP PARTITION in Spark

2024-08-02 Thread Xianjin YE
> we would instead add support for pushing down `CAST` expressions from Spark Supporting pushing down more expressions is definitely worthy to explore with. IIUC, we should already be able to do this kind of thing thanks to system function push down. Users can issue a query to deterministically

Re: [DISCUSS] DROP PARTITION in Spark

2024-08-02 Thread Ryan Blue
There's a potential solution that's similar to what Xianjin suggested. Rather than adding a new SQL keyword (which is a lot of work and specific to Iceberg) we would instead add support for pushing down `CAST` expressions from Spark. That way you could use filters like `DELETE FROM table WHERE cast

Re: [DISCUSS] DROP PARTITION in Spark

2024-08-02 Thread Xianjin YE
> b) they have a concern that with getting the WHERE filter of the DELETE not > aligned with partition boundaries they might end up having pos-deletes that > could have an impact on their read perf I think this is a legit concern and currently `DELETE FROM` cannot guarantee that. It would be va

Re: [DISCUSS] DROP PARTITION in Spark

2024-08-02 Thread Gabor Kaszab
Hey Everyone, Thanks for the responses and sorry for the long delay in mine. Let me try to answer the questions that came up. Yes, this has been an ask from a specific user who finds the lack of DROP PARTITION as a blocker for migrating to Iceberg from Hive tables. I know, our initial response wa

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-22 Thread Jean-Baptiste Onofré
Hi Walaa It makes sense, thanks for pointing the use case. I agree that it's better to consider a use-case specific impl. Regards JB On Wed, Jul 17, 2024 at 11:36 PM Walaa Eldin Moustafa wrote: > > Hi Jean, One use case is Hive to Iceberg migration, where DROP PARTITION does > not need to cha

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Ryan Blue
I agree with Walaa. Iceberg doesn't support partitions as specific structures, which is why it makes no sense to implement ADD PARTITION. While a DROP PARTITION may be convenient, it would actually be misleading. If you changed the partitioning of a table, DROP PARTITION would no longer work and it

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Steve Zhang
Mostly agreed with Walaa’s statement above, I think partition is first class citizen in hive but was modeled differently in iceberg to support hidden partition and partition evolution. To me, the partition in hive is explicit and static, the partition clause in DROP PARTITION can be error pro

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Walaa Eldin Moustafa
Hi Jean, One use case is Hive to Iceberg migration, where DROP PARTITION does not need to change to DELETE queries prior to the migration. That said, I am not in favor of adding this to Iceberg directly (or Iceberg-Spark) due to the reasons Jean mentioned. It might be possible to do it in a custom

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Jean-Baptiste Onofré
Hi Gabor Do you have user requests for that ? As Iceberg produces partitions by taking column values (optionally with a transform function). So the hidden partitioning doesn't require user actions. I wonder the use cases for dynamic partitioning (using ADD/DROP). Is it more for partition maintenan

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Yufei Gu
Based on my observations, users don't appear to be missing this feature, but I'm OK to add it in Spark for compatibility purposes. Yufei On Wed, Jul 17, 2024 at 11:14 AM Szehon Ho wrote: > Hi Gabor > > I'm neutral for this, but can be convinced. My initial thoughts is that > there would be no

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Szehon Ho
Hi Gabor I'm neutral for this, but can be convinced. My initial thoughts is that there would be no way to have ADD PARTITION (I assume old Hive workloads would rely on this), and these are not ANSI SQL standard statements as Spark moves to that direction. The second point of guaranteeing a metad

[DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Gabor Kaszab
Hey Community, I learned recently that Spark doesn't support DROP PARTITION for Iceberg tables. I understand this is because the DROP PARTITION is something being used for Hive tables and Iceberg's model for hidden partitioning makes it unnatural to have commands like this. However, I think that