DSv2 sync - 4 September 2019

Ryan Blue Fri, 06 Sep 2019 15:29:23 -0700

Here are my notes from the latest sync. Feel free to reply with
clarifications if I’ve missed anything.


*Attendees*:

Ryan Blue
John Zhuge
Russell Spitzer
Matt Cheah
Gengliang Wang
Priyanka Gomatam
Holden Karau

*Topics*:

   - DataFrameWriterV2 insert vs append (recap)
   - ANSI and strict modes for inserting casts
   - Separating identifier resolution from table lookup
   - Open PRs
      - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
      - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
      - TableProvider API update -
      https://github.com/apache/spark/pull/25651
      - UPDATE - https://github.com/apache/spark/pull/25626

*Discussion*:

   - DataFrameWriterV2 insert vs append discussion recapped the agreement
   from last sync
   - ANSI and strict modes for inserting casts:
      - Russell: Failure modes are important. ANSI behavior is to fail at
      runtime, not analysis time. If a cast is allowed, but doesn’t throw an
      exception at runtime then this can’t be considered ANSI behavior.
      - Gengliang: ANSI adds the cast
      - Matt: Sounds like there are two conflicting views of the world. Is
      the default ANSI behavior to insert a cast that may produce NULL
or to fail
      at runtime?
      - Ryan: So analysis and runtime behaviors can’t be separate?
      - Matt: Analysis behavior is influenced by behavior at runtime. Maybe
      the vote should cover both?
      - Russell: (linked to the standard) There are 3 steps: if numeric and
      same type, use the data value. If the value can be rounded or truncated,
      round or truncate. Otherwise, throw an exception that a value can’t be
      cast. These are runtime requirements.
      - Ryan: Another consideration is that we can make Spark more
      permissive, but can’t make Spark more strict in future releases.
      - Matt: v1 silently corrupts data
      - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
      Don’t tell people it’s ANSI and not do ANSI completely.
      - Gengliang: people are concerned about long-running jobs failing at
      the end
      - Ryan: That’s okay because they can change the defaults: use strict
      analysis-time validation, or allow casts to produce NULL values.
      - Matt: As long as this is well documented, it should be fine
      - Ryan: Can we run tests to find out what exactly the behavior is?
      - Gengliang: sqlfiddle.com
      - Russell ran tests in MySQL and Postgres. Both threw runtime
      failures.
      - Matt: Let’s move on, but add the runtime behavior to the VOTE
   - Identifier resolution and table lookup
      - Ryan: recent changes merged identifier resolution and table lookup
      together because identifiers owned by the session catalog need
to be loaded
      to find out whether to use v1 or v2 plans. I think this should
be separated
      so that identifier resolution happens independently to ensure
that the two
      separate tasks don’t end up getting done at the same time and
      over-complicating the analyzer.
   - SHOW NAMESPACES - Ready for final review
   - DataFrameWriterV2:
      - Ryan: Tests failed after passing on the PR. Anyone know why that
      would happen?
      - Gengliang: tests failed in maven
      - Holden: PR validation runs SBT tests
   - TableProvider API update: skipped because Wenchen didn’t make it
   - UPDATE support PR
      - Ryan: There is a PR to add a SQL UPDATE command, but it delegates
      entirely to the data source, which seems strange.
      - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
      statement only to pass it entirely to another engine?
      - Ryan: It does make sense to do this. If Spark eventually supports
      MERGE INTO and other row-level operations, then it makes sense
to push down
      the operation to some sources, like JDBC. I just find it backward to add
      the pushdown API before adding an implementation that handles this inside
      Spark — pushdown is usually an optimization.
      - Russell: Would this be safe? Spark retries lots of operations.
      - Ryan: I think it would be safe because Spark won’t retry top-level
      operations and this is a single method call. Nothing would get retried.
      - Ryan: I’ll ask what the PR author’s use case is. Maybe that would
      help clarify why this is a good idea.

-- 
Ryan Blue
Software Engineer
Netflix

DSv2 sync - 4 September 2019

Reply via email to