Ah yes, on rereading the original email I see that the sync discussion was different. Thanks for the clarification! I’ll file a JIRA about PERMISSIVE.
2019년 9월 9일 (월) 오전 6:05, Wenchen Fan <cloud0...@gmail.com>님이 작성: > Hi Nicholas, > > You are talking about a different thing. The PERMISSIVE mode is the > failure mode for reading text-based data source (json, csv, etc.). It's not > the general failure mode for Spark table insertion. > > I agree with you that the PERMISSIVE mode is hard to use. Feel free to > open a JIRA ticket if you have some better ideas. > > Thanks, > Wenchen > > On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> A quick question about failure modes, as a casual observer of the DSv2 >> effort: >> >> I was considering filing a JIRA ticket about enhancing the >> DataFrameReader to include the failure *reason* in addition to the >> corrupt record when the mode is PERMISSIVE. So if you are loading a CSV, >> for example, and a value cannot be automatically cast to the type you >> specify in the schema, you'll get the corrupt record in the column >> configured by columnNameOfCorruptRecord, but you'll also get some detail >> about what exactly made the record corrupt, perhaps in a new column >> specified by something like columnNameOfCorruptReason. >> >> Is this an enhancement that would be possible in DSv2? >> >> On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Here are my notes from the latest sync. Feel free to reply with >>> clarifications if I’ve missed anything. >>> >>> *Attendees*: >>> >>> Ryan Blue >>> John Zhuge >>> Russell Spitzer >>> Matt Cheah >>> Gengliang Wang >>> Priyanka Gomatam >>> Holden Karau >>> >>> *Topics*: >>> >>> - DataFrameWriterV2 insert vs append (recap) >>> - ANSI and strict modes for inserting casts >>> - Separating identifier resolution from table lookup >>> - Open PRs >>> - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601 >>> - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681 >>> - TableProvider API update - >>> https://github.com/apache/spark/pull/25651 >>> - UPDATE - https://github.com/apache/spark/pull/25626 >>> >>> *Discussion*: >>> >>> - DataFrameWriterV2 insert vs append discussion recapped the >>> agreement from last sync >>> - ANSI and strict modes for inserting casts: >>> - Russell: Failure modes are important. ANSI behavior is to fail >>> at runtime, not analysis time. If a cast is allowed, but doesn’t >>> throw an >>> exception at runtime then this can’t be considered ANSI behavior. >>> - Gengliang: ANSI adds the cast >>> - Matt: Sounds like there are two conflicting views of the world. >>> Is the default ANSI behavior to insert a cast that may produce NULL >>> or to >>> fail at runtime? >>> - Ryan: So analysis and runtime behaviors can’t be separate? >>> - Matt: Analysis behavior is influenced by behavior at runtime. >>> Maybe the vote should cover both? >>> - Russell: (linked to the standard) There are 3 steps: if numeric >>> and same type, use the data value. If the value can be rounded or >>> truncated, round or truncate. Otherwise, throw an exception that a >>> value >>> can’t be cast. These are runtime requirements. >>> - Ryan: Another consideration is that we can make Spark more >>> permissive, but can’t make Spark more strict in future releases. >>> - Matt: v1 silently corrupts data >>> - Russell: ANSI is fine, as long as the runtime matches (is >>> ANSI). Don’t tell people it’s ANSI and not do ANSI completely. >>> - Gengliang: people are concerned about long-running jobs failing >>> at the end >>> - Ryan: That’s okay because they can change the defaults: use >>> strict analysis-time validation, or allow casts to produce NULL >>> values. >>> - Matt: As long as this is well documented, it should be fine >>> - Ryan: Can we run tests to find out what exactly the behavior is? >>> - Gengliang: sqlfiddle.com >>> - Russell ran tests in MySQL and Postgres. Both threw runtime >>> failures. >>> - Matt: Let’s move on, but add the runtime behavior to the VOTE >>> - Identifier resolution and table lookup >>> - Ryan: recent changes merged identifier resolution and table >>> lookup together because identifiers owned by the session catalog need >>> to be >>> loaded to find out whether to use v1 or v2 plans. I think this should >>> be >>> separated so that identifier resolution happens independently to >>> ensure >>> that the two separate tasks don’t end up getting done at the same >>> time and >>> over-complicating the analyzer. >>> - SHOW NAMESPACES - Ready for final review >>> - DataFrameWriterV2: >>> - Ryan: Tests failed after passing on the PR. Anyone know why >>> that would happen? >>> - Gengliang: tests failed in maven >>> - Holden: PR validation runs SBT tests >>> - TableProvider API update: skipped because Wenchen didn’t make it >>> - UPDATE support PR >>> - Ryan: There is a PR to add a SQL UPDATE command, but it >>> delegates entirely to the data source, which seems strange. >>> - Matt: What is Spark’s purpose here? Why would Spark parse a SQL >>> statement only to pass it entirely to another engine? >>> - Ryan: It does make sense to do this. If Spark eventually >>> supports MERGE INTO and other row-level operations, then it makes >>> sense to >>> push down the operation to some sources, like JDBC. I just find it >>> backward >>> to add the pushdown API before adding an implementation that handles >>> this >>> inside Spark — pushdown is usually an optimization. >>> - Russell: Would this be safe? Spark retries lots of operations. >>> - Ryan: I think it would be safe because Spark won’t retry >>> top-level operations and this is a single method call. Nothing would >>> get >>> retried. >>> - Ryan: I’ll ask what the PR author’s use case is. Maybe that >>> would help clarify why this is a good idea. >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >>