Re: DSv2 sync - 4 September 2019

Wenchen Fan Mon, 09 Sep 2019 03:05:45 -0700

Hi Nicholas,

You are talking about a different thing. The PERMISSIVE mode is the failure
mode for reading text-based data source (json, csv, etc.). It's not the
general failure mode for Spark table insertion.


I agree with you that the PERMISSIVE mode is hard to use. Feel free to open
a JIRA ticket if you have some better ideas.

Thanks,
Wenchen

On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> A quick question about failure modes, as a casual observer of the DSv2
> effort:
>
> I was considering filing a JIRA ticket about enhancing the DataFrameReader
> to include the failure *reason* in addition to the corrupt record when
> the mode is PERMISSIVE. So if you are loading a CSV, for example, and a
> value cannot be automatically cast to the type you specify in the schema,
> you'll get the corrupt record in the column configured by
> columnNameOfCorruptRecord, but you'll also get some detail about what
> exactly made the record corrupt, perhaps in a new column specified by
> something like columnNameOfCorruptReason.
>
> Is this an enhancement that would be possible in DSv2?
>
> On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Here are my notes from the latest sync. Feel free to reply with
>> clarifications if I’ve missed anything.
>>
>> *Attendees*:
>>
>> Ryan Blue
>> John Zhuge
>> Russell Spitzer
>> Matt Cheah
>> Gengliang Wang
>> Priyanka Gomatam
>> Holden Karau
>>
>> *Topics*:
>>
>>    - DataFrameWriterV2 insert vs append (recap)
>>    - ANSI and strict modes for inserting casts
>>    - Separating identifier resolution from table lookup
>>    - Open PRs
>>       - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>>       - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>>       - TableProvider API update -
>>       https://github.com/apache/spark/pull/25651
>>       - UPDATE - https://github.com/apache/spark/pull/25626
>>
>> *Discussion*:
>>
>>    - DataFrameWriterV2 insert vs append discussion recapped the
>>    agreement from last sync
>>    - ANSI and strict modes for inserting casts:
>>       - Russell: Failure modes are important. ANSI behavior is to fail
>>       at runtime, not analysis time. If a cast is allowed, but doesn’t throw 
>> an
>>       exception at runtime then this can’t be considered ANSI behavior.
>>       - Gengliang: ANSI adds the cast
>>       - Matt: Sounds like there are two conflicting views of the world.
>>       Is the default ANSI behavior to insert a cast that may produce NULL or 
>> to
>>       fail at runtime?
>>       - Ryan: So analysis and runtime behaviors can’t be separate?
>>       - Matt: Analysis behavior is influenced by behavior at runtime.
>>       Maybe the vote should cover both?
>>       - Russell: (linked to the standard) There are 3 steps: if numeric
>>       and same type, use the data value. If the value can be rounded or
>>       truncated, round or truncate. Otherwise, throw an exception that a 
>> value
>>       can’t be cast. These are runtime requirements.
>>       - Ryan: Another consideration is that we can make Spark more
>>       permissive, but can’t make Spark more strict in future releases.
>>       - Matt: v1 silently corrupts data
>>       - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
>>       Don’t tell people it’s ANSI and not do ANSI completely.
>>       - Gengliang: people are concerned about long-running jobs failing
>>       at the end
>>       - Ryan: That’s okay because they can change the defaults: use
>>       strict analysis-time validation, or allow casts to produce NULL values.
>>       - Matt: As long as this is well documented, it should be fine
>>       - Ryan: Can we run tests to find out what exactly the behavior is?
>>       - Gengliang: sqlfiddle.com
>>       - Russell ran tests in MySQL and Postgres. Both threw runtime
>>       failures.
>>       - Matt: Let’s move on, but add the runtime behavior to the VOTE
>>    - Identifier resolution and table lookup
>>       - Ryan: recent changes merged identifier resolution and table
>>       lookup together because identifiers owned by the session catalog need 
>> to be
>>       loaded to find out whether to use v1 or v2 plans. I think this should 
>> be
>>       separated so that identifier resolution happens independently to ensure
>>       that the two separate tasks don’t end up getting done at the same time 
>> and
>>       over-complicating the analyzer.
>>    - SHOW NAMESPACES - Ready for final review
>>    - DataFrameWriterV2:
>>       - Ryan: Tests failed after passing on the PR. Anyone know why that
>>       would happen?
>>       - Gengliang: tests failed in maven
>>       - Holden: PR validation runs SBT tests
>>    - TableProvider API update: skipped because Wenchen didn’t make it
>>    - UPDATE support PR
>>       - Ryan: There is a PR to add a SQL UPDATE command, but it
>>       delegates entirely to the data source, which seems strange.
>>       - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
>>       statement only to pass it entirely to another engine?
>>       - Ryan: It does make sense to do this. If Spark eventually
>>       supports MERGE INTO and other row-level operations, then it makes 
>> sense to
>>       push down the operation to some sources, like JDBC. I just find it 
>> backward
>>       to add the pushdown API before adding an implementation that handles 
>> this
>>       inside Spark — pushdown is usually an optimization.
>>       - Russell: Would this be safe? Spark retries lots of operations.
>>       - Ryan: I think it would be safe because Spark won’t retry
>>       top-level operations and this is a single method call. Nothing would 
>> get
>>       retried.
>>       - Ryan: I’ll ask what the PR author’s use case is. Maybe that
>>       would help clarify why this is a good idea.
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: DSv2 sync - 4 September 2019

Reply via email to