Here are my notes from the latest sync. Feel free to reply with clarifications if I’ve missed anything.
*Attendees*: Ryan Blue John Zhuge Russell Spitzer Matt Cheah Gengliang Wang Priyanka Gomatam Holden Karau *Topics*: - DataFrameWriterV2 insert vs append (recap) - ANSI and strict modes for inserting casts - Separating identifier resolution from table lookup - Open PRs - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601 - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681 - TableProvider API update - https://github.com/apache/spark/pull/25651 - UPDATE - https://github.com/apache/spark/pull/25626 *Discussion*: - DataFrameWriterV2 insert vs append discussion recapped the agreement from last sync - ANSI and strict modes for inserting casts: - Russell: Failure modes are important. ANSI behavior is to fail at runtime, not analysis time. If a cast is allowed, but doesn’t throw an exception at runtime then this can’t be considered ANSI behavior. - Gengliang: ANSI adds the cast - Matt: Sounds like there are two conflicting views of the world. Is the default ANSI behavior to insert a cast that may produce NULL or to fail at runtime? - Ryan: So analysis and runtime behaviors can’t be separate? - Matt: Analysis behavior is influenced by behavior at runtime. Maybe the vote should cover both? - Russell: (linked to the standard) There are 3 steps: if numeric and same type, use the data value. If the value can be rounded or truncated, round or truncate. Otherwise, throw an exception that a value can’t be cast. These are runtime requirements. - Ryan: Another consideration is that we can make Spark more permissive, but can’t make Spark more strict in future releases. - Matt: v1 silently corrupts data - Russell: ANSI is fine, as long as the runtime matches (is ANSI). Don’t tell people it’s ANSI and not do ANSI completely. - Gengliang: people are concerned about long-running jobs failing at the end - Ryan: That’s okay because they can change the defaults: use strict analysis-time validation, or allow casts to produce NULL values. - Matt: As long as this is well documented, it should be fine - Ryan: Can we run tests to find out what exactly the behavior is? - Gengliang: sqlfiddle.com - Russell ran tests in MySQL and Postgres. Both threw runtime failures. - Matt: Let’s move on, but add the runtime behavior to the VOTE - Identifier resolution and table lookup - Ryan: recent changes merged identifier resolution and table lookup together because identifiers owned by the session catalog need to be loaded to find out whether to use v1 or v2 plans. I think this should be separated so that identifier resolution happens independently to ensure that the two separate tasks don’t end up getting done at the same time and over-complicating the analyzer. - SHOW NAMESPACES - Ready for final review - DataFrameWriterV2: - Ryan: Tests failed after passing on the PR. Anyone know why that would happen? - Gengliang: tests failed in maven - Holden: PR validation runs SBT tests - TableProvider API update: skipped because Wenchen didn’t make it - UPDATE support PR - Ryan: There is a PR to add a SQL UPDATE command, but it delegates entirely to the data source, which seems strange. - Matt: What is Spark’s purpose here? Why would Spark parse a SQL statement only to pass it entirely to another engine? - Ryan: It does make sense to do this. If Spark eventually supports MERGE INTO and other row-level operations, then it makes sense to push down the operation to some sources, like JDBC. I just find it backward to add the pushdown API before adding an implementation that handles this inside Spark — pushdown is usually an optimization. - Russell: Would this be safe? Spark retries lots of operations. - Ryan: I think it would be safe because Spark won’t retry top-level operations and this is a single method call. Nothing would get retried. - Ryan: I’ll ask what the PR author’s use case is. Maybe that would help clarify why this is a good idea. -- Ryan Blue Software Engineer Netflix