Hi Nicholas, You are talking about a different thing. The PERMISSIVE mode is the failure mode for reading text-based data source (json, csv, etc.). It's not the general failure mode for Spark table insertion.
I agree with you that the PERMISSIVE mode is hard to use. Feel free to open a JIRA ticket if you have some better ideas. Thanks, Wenchen On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > A quick question about failure modes, as a casual observer of the DSv2 > effort: > > I was considering filing a JIRA ticket about enhancing the DataFrameReader > to include the failure *reason* in addition to the corrupt record when > the mode is PERMISSIVE. So if you are loading a CSV, for example, and a > value cannot be automatically cast to the type you specify in the schema, > you'll get the corrupt record in the column configured by > columnNameOfCorruptRecord, but you'll also get some detail about what > exactly made the record corrupt, perhaps in a new column specified by > something like columnNameOfCorruptReason. > > Is this an enhancement that would be possible in DSv2? > > On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Here are my notes from the latest sync. Feel free to reply with >> clarifications if I’ve missed anything. >> >> *Attendees*: >> >> Ryan Blue >> John Zhuge >> Russell Spitzer >> Matt Cheah >> Gengliang Wang >> Priyanka Gomatam >> Holden Karau >> >> *Topics*: >> >> - DataFrameWriterV2 insert vs append (recap) >> - ANSI and strict modes for inserting casts >> - Separating identifier resolution from table lookup >> - Open PRs >> - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601 >> - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681 >> - TableProvider API update - >> https://github.com/apache/spark/pull/25651 >> - UPDATE - https://github.com/apache/spark/pull/25626 >> >> *Discussion*: >> >> - DataFrameWriterV2 insert vs append discussion recapped the >> agreement from last sync >> - ANSI and strict modes for inserting casts: >> - Russell: Failure modes are important. ANSI behavior is to fail >> at runtime, not analysis time. If a cast is allowed, but doesn’t throw >> an >> exception at runtime then this can’t be considered ANSI behavior. >> - Gengliang: ANSI adds the cast >> - Matt: Sounds like there are two conflicting views of the world. >> Is the default ANSI behavior to insert a cast that may produce NULL or >> to >> fail at runtime? >> - Ryan: So analysis and runtime behaviors can’t be separate? >> - Matt: Analysis behavior is influenced by behavior at runtime. >> Maybe the vote should cover both? >> - Russell: (linked to the standard) There are 3 steps: if numeric >> and same type, use the data value. If the value can be rounded or >> truncated, round or truncate. Otherwise, throw an exception that a >> value >> can’t be cast. These are runtime requirements. >> - Ryan: Another consideration is that we can make Spark more >> permissive, but can’t make Spark more strict in future releases. >> - Matt: v1 silently corrupts data >> - Russell: ANSI is fine, as long as the runtime matches (is ANSI). >> Don’t tell people it’s ANSI and not do ANSI completely. >> - Gengliang: people are concerned about long-running jobs failing >> at the end >> - Ryan: That’s okay because they can change the defaults: use >> strict analysis-time validation, or allow casts to produce NULL values. >> - Matt: As long as this is well documented, it should be fine >> - Ryan: Can we run tests to find out what exactly the behavior is? >> - Gengliang: sqlfiddle.com >> - Russell ran tests in MySQL and Postgres. Both threw runtime >> failures. >> - Matt: Let’s move on, but add the runtime behavior to the VOTE >> - Identifier resolution and table lookup >> - Ryan: recent changes merged identifier resolution and table >> lookup together because identifiers owned by the session catalog need >> to be >> loaded to find out whether to use v1 or v2 plans. I think this should >> be >> separated so that identifier resolution happens independently to ensure >> that the two separate tasks don’t end up getting done at the same time >> and >> over-complicating the analyzer. >> - SHOW NAMESPACES - Ready for final review >> - DataFrameWriterV2: >> - Ryan: Tests failed after passing on the PR. Anyone know why that >> would happen? >> - Gengliang: tests failed in maven >> - Holden: PR validation runs SBT tests >> - TableProvider API update: skipped because Wenchen didn’t make it >> - UPDATE support PR >> - Ryan: There is a PR to add a SQL UPDATE command, but it >> delegates entirely to the data source, which seems strange. >> - Matt: What is Spark’s purpose here? Why would Spark parse a SQL >> statement only to pass it entirely to another engine? >> - Ryan: It does make sense to do this. If Spark eventually >> supports MERGE INTO and other row-level operations, then it makes >> sense to >> push down the operation to some sources, like JDBC. I just find it >> backward >> to add the pushdown API before adding an implementation that handles >> this >> inside Spark — pushdown is usually an optimization. >> - Russell: Would this be safe? Spark retries lots of operations. >> - Ryan: I think it would be safe because Spark won’t retry >> top-level operations and this is a single method call. Nothing would >> get >> retried. >> - Ryan: I’ll ask what the PR author’s use case is. Maybe that >> would help clarify why this is a good idea. >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >