Here are my notes from the latest sync. Feel free to reply with
clarifications if I’ve missed anything.
*Attendees*:
Ryan Blue
John Zhuge
Russell Spitzer
Matt Cheah
Gengliang Wang
Priyanka Gomatam
Holden Karau
*Topics*:
- DataFrameWriterV2 insert vs append (recap)
- ANSI and strict modes for inserting casts
- Separating identifier resolution from table lookup
- Open PRs
- SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
- DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
- TableProvider API update -
https://github.com/apache/spark/pull/25651
- UPDATE - https://github.com/apache/spark/pull/25626
*Discussion*:
- DataFrameWriterV2 insert vs append discussion recapped the agreement
from last sync
- ANSI and strict modes for inserting casts:
- Russell: Failure modes are important. ANSI behavior is to fail at
runtime, not analysis time. If a cast is allowed, but doesn’t throw an
exception at runtime then this can’t be considered ANSI behavior.
- Gengliang: ANSI adds the cast
- Matt: Sounds like there are two conflicting views of the world. Is
the default ANSI behavior to insert a cast that may produce NULL
or to fail
at runtime?
- Ryan: So analysis and runtime behaviors can’t be separate?
- Matt: Analysis behavior is influenced by behavior at runtime. Maybe
the vote should cover both?
- Russell: (linked to the standard) There are 3 steps: if numeric and
same type, use the data value. If the value can be rounded or truncated,
round or truncate. Otherwise, throw an exception that a value can’t be
cast. These are runtime requirements.
- Ryan: Another consideration is that we can make Spark more
permissive, but can’t make Spark more strict in future releases.
- Matt: v1 silently corrupts data
- Russell: ANSI is fine, as long as the runtime matches (is ANSI).
Don’t tell people it’s ANSI and not do ANSI completely.
- Gengliang: people are concerned about long-running jobs failing at
the end
- Ryan: That’s okay because they can change the defaults: use strict
analysis-time validation, or allow casts to produce NULL values.
- Matt: As long as this is well documented, it should be fine
- Ryan: Can we run tests to find out what exactly the behavior is?
- Gengliang: sqlfiddle.com
- Russell ran tests in MySQL and Postgres. Both threw runtime
failures.
- Matt: Let’s move on, but add the runtime behavior to the VOTE
- Identifier resolution and table lookup
- Ryan: recent changes merged identifier resolution and table lookup
together because identifiers owned by the session catalog need
to be loaded
to find out whether to use v1 or v2 plans. I think this should
be separated
so that identifier resolution happens independently to ensure
that the two
separate tasks don’t end up getting done at the same time and
over-complicating the analyzer.
- SHOW NAMESPACES - Ready for final review
- DataFrameWriterV2:
- Ryan: Tests failed after passing on the PR. Anyone know why that
would happen?
- Gengliang: tests failed in maven
- Holden: PR validation runs SBT tests
- TableProvider API update: skipped because Wenchen didn’t make it
- UPDATE support PR
- Ryan: There is a PR to add a SQL UPDATE command, but it delegates
entirely to the data source, which seems strange.
- Matt: What is Spark’s purpose here? Why would Spark parse a SQL
statement only to pass it entirely to another engine?
- Ryan: It does make sense to do this. If Spark eventually supports
MERGE INTO and other row-level operations, then it makes sense
to push down
the operation to some sources, like JDBC. I just find it backward to add
the pushdown API before adding an implementation that handles this inside
Spark — pushdown is usually an optimization.
- Russell: Would this be safe? Spark retries lots of operations.
- Ryan: I think it would be safe because Spark won’t retry top-level
operations and this is a single method call. Nothing would get retried.
- Ryan: I’ll ask what the PR author’s use case is. Maybe that would
help clarify why this is a good idea.
--
Ryan Blue
Software Engineer
Netflix