[GitHub] [spark] rdblue commented on pull request #31475: [SPARK-34360][SQL] Support table truncation by v2 Table Catalogs

2021-02-09 Thread GitBox


rdblue commented on pull request #31475:
URL: https://github.com/apache/spark/pull/31475#issuecomment-776119870


   @MaxGekk, thanks. Then let's work on updating this to fit more cleanly with 
the design of v2 catalogs and tables. This should be a `Table` interface, not 
an extension to `TableCatalog`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on pull request #31475: [SPARK-34360][SQL] Support table truncation by v2 Table Catalogs

2021-02-08 Thread GitBox


rdblue commented on pull request #31475:
URL: https://github.com/apache/spark/pull/31475#issuecomment-775490751


   @MaxGekk, can you share the use case that you have for this? You mentioned 
truncation-specific optimizations. I think working with concrete use cases is 
usually a good idea. If these are theoretical only -- like a user that can drop 
all data but not a subset -- then we should put this off. If there's a specific 
case, then let's discuss it.
   
   I agree that there _may_ be good reason to pass that the engine's intent was 
to truncate. That's why we have `SupportsTruncate` for the write builder. And I 
agree with you that we don't necessarily need to use an atomic operation that 
could truncate and add data at the same time. Your point about not having 
insert permissions is a good one to justify not using `SupportsTruncate`, 
although the case of a user that can drop all data but not subsets doesn't 
sound real. The point about truncation possibly being a metadata operation is 
why we added `SupportsDelete` at the table level.
   
   Those points may indicate that an interface to truncate a table as a 
stand-alone operation is valid, although I still think that it is a bad idea to 
add more interfaces to v2 without a reasonable expectation that they will 
actually be used.
   
   Another problem here is that this is operation is proposed at the catalog 
level, which does not fit with how v2 works. I think that the reason for this 
is emulating what the Hive does, but that's not usually a good choice.
   
   In v2, catalogs load tables and tables are modified. That's why 
`SupportsDelete` extends `Table` and not `TableCatalog`. This keeps concerns 
separate, so we have a way to handle tables that don't exist and a separate way 
to handle tables that don't support a certain operation. Mixing those two 
together at the catalog level over-complicates the API, requiring a source to 
throw one exception if the table doesn't exist and another if it doesn't 
support truncation. (We also went through this discussion with the recently 
added interfaces to add/drop partitions.)
   
   Assuming that it is worth adding this interface, I would expect it to be a 
mix-in for `Table`. And like `SupportsOverwrite` that implements 
`SupportsTruncate`, I think this should also update `SupportsDelete` so that 
tables don't need to implement both interfaces.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rdblue commented on pull request #31475: [SPARK-34360][SQL] Support table truncation by v2 Table Catalogs

2021-02-08 Thread GitBox


rdblue commented on pull request #31475:
URL: https://github.com/apache/spark/pull/31475#issuecomment-775318833


   @MaxGekk, why is this necessary instead of deleting from the table or 
overwriting everything with no new records? I don't see a good reason to do 
this, especially at the catalog level instead of the table level. Introducing 
new ways to do something that is already possible over-complicates the API and 
is a step in the wrong direction.
   
   Please consider this a -1 until we come to consensus -- I may support it in 
the end, but I don't want anyone choosing to commit despite disagreement in the 
mean time.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org