Here are my notes from this week’s sync. *Attendees*:
Ryan Blue John Zhuge Dale Richardson Gabor Somogyi Matt Cheah Yifei Huang Xin Ren Jose Torres Gengliang Wang Kevin Yu *Topics*: - Metadata columns or function push-down for Kafka v2 source - Open PRs - REPLACE TABLE implementation: https://github.com/apache/spark/pull/24798 - Add v2 AlterTable: https://github.com/apache/spark/pull/24937 - Add v2 SessionCatalog: https://github.com/apache/spark/pull/24768 - SupportsNamespaces PR: https://github.com/apache/spark/pull/24560 - Remaining SQL statements to implement - IF NOT EXISTS for INSERT OVERWRITE - V2 file source compatibility *Discussion*: - Metadata columns or function push-down for Kafka v2 source - Ryan: Kafka v1 source has more read columns than write columns. This is to expose metadata like partition, offset, and timestamp. Those are read columns, but not write columns, which is fine in v1. v2 requires a table schema - Ryan: Two main options to fix this in v2: add metadata columns like Presto’s $file or add function push-down similar to Spark’s input_file_name(). Metadata columns require less work (expose additional columns) but functions are more flexible (can call modified_time(col1)) - Gabor: That’s a fair summary - Jose: the problem with input_file_name() is that it can be called anywhere, but is only valid in the context of a projection. After a group by, it returns empty string, which is odd. - Ryan: Couldn’t we handle that case using push-down? It is a function defined by a source that can only be run by pushing it down. It doesn’t exist after a group by, so analysis would fail if it were used in that context, just like columns don’t exist in the group by context unless they were in the grouping expression or created by aggregates. - Jose: That would work - Ryan: The metadata column approach takes less work, so I think we should do that unless someone has the time to drive the function push-down option. - Gabor: this is not required to move to v2. Currently working around this by not validating the schema. PR: https://github.com/apache/spark/pull/24738 - Mostly consensus around using metadata column approach. - REPLACE TABLE PR: - Matt: this is mostly ready, just waiting for final reviews - AlterTable PR: - Gengliang: should this handle complex updates, like replacing a struct with a different struct? - Ryan: You’re right, that doesn’t make sense. I’ll update the PR [Note: done] - V2 session catalog PR: - Ryan: We talked about this last time. Any objections? - Jose: No, this is blocking us right now - SupportsNamespaces PR: - Ryan: Please look at this, it blocks remaining SQL statements like SHOW NAMESPACES, etc. - Remaining SQL statements: - Ryan: presented a list of remaining SQL statements that need to be implemented - Important statements (for Spark 3.0): - DESCRIBE [FORMATTED|EXTENDED] [TABLE] ... - REFRESH TABLE ... - SHOW TABLES [IN catalog] [LIKE ...] - USE CATALOG ... to set the default catalog - Other missing SQL statements; most depend on SupportsNamespaces PR: - DESCRIBE [EXTENDED] (NAMESPACE|DATABASE) ... - SHOW (NAMESPACES|DATABASES) [IN catalog] [LIKE ...] - CREATE (NAMESPACE|DATABASE) ... [PROPERTIES (...)] - DROP (NAMESPACE|DATABASE) ... - USE ... [IN catalog] to set current namespace in a catalog - Matt offered to implement DESCRIBE TABLE - IF NOT EXISTS with INSERT INTO - John: This validates that overwrite does not overwrite partitions and is append only. Should this be supported? - Consensus was “why not?” Will add a mix-in trait in a follow-up for sources that choose to implement it - File source compatibility - Ryan: I tried to use built-in sources like Parquet in SQL tests and hit problems. Not being able to pass a schema or table partitioning means that these tables won’t behave right. What is the plan to get these sources working with SQL? - No one has time to work on this - Ryan: I’ll write some tests to at least set out the contract so we know when the built-in sources are ready to be used. -- Ryan Blue Software Engineer Netflix