Re: DSv2 sync - 4 September 2019

2019-09-09 Thread Nicholas Chammas
Ah yes, on rereading the original email I see that the sync discussion was
different. Thanks for the clarification! I’ll file a JIRA about PERMISSIVE.

2019년 9월 9일 (월) 오전 6:05, Wenchen Fan 님이 작성:

> Hi Nicholas,
>
> You are talking about a different thing. The PERMISSIVE mode is the
> failure mode for reading text-based data source (json, csv, etc.). It's not
> the general failure mode for Spark table insertion.
>
> I agree with you that the PERMISSIVE mode is hard to use. Feel free to
> open a JIRA ticket if you have some better ideas.
>
> Thanks,
> Wenchen
>
> On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> A quick question about failure modes, as a casual observer of the DSv2
>> effort:
>>
>> I was considering filing a JIRA ticket about enhancing the
>> DataFrameReader to include the failure *reason* in addition to the
>> corrupt record when the mode is PERMISSIVE. So if you are loading a CSV,
>> for example, and a value cannot be automatically cast to the type you
>> specify in the schema, you'll get the corrupt record in the column
>> configured by columnNameOfCorruptRecord, but you'll also get some detail
>> about what exactly made the record corrupt, perhaps in a new column
>> specified by something like columnNameOfCorruptReason.
>>
>> Is this an enhancement that would be possible in DSv2?
>>
>> On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue 
>> wrote:
>>
>>> Here are my notes from the latest sync. Feel free to reply with
>>> clarifications if I’ve missed anything.
>>>
>>> *Attendees*:
>>>
>>> Ryan Blue
>>> John Zhuge
>>> Russell Spitzer
>>> Matt Cheah
>>> Gengliang Wang
>>> Priyanka Gomatam
>>> Holden Karau
>>>
>>> *Topics*:
>>>
>>>- DataFrameWriterV2 insert vs append (recap)
>>>- ANSI and strict modes for inserting casts
>>>- Separating identifier resolution from table lookup
>>>- Open PRs
>>>   - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>>>   - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>>>   - TableProvider API update -
>>>   https://github.com/apache/spark/pull/25651
>>>   - UPDATE - https://github.com/apache/spark/pull/25626
>>>
>>> *Discussion*:
>>>
>>>- DataFrameWriterV2 insert vs append discussion recapped the
>>>agreement from last sync
>>>- ANSI and strict modes for inserting casts:
>>>   - Russell: Failure modes are important. ANSI behavior is to fail
>>>   at runtime, not analysis time. If a cast is allowed, but doesn’t 
>>> throw an
>>>   exception at runtime then this can’t be considered ANSI behavior.
>>>   - Gengliang: ANSI adds the cast
>>>   - Matt: Sounds like there are two conflicting views of the world.
>>>   Is the default ANSI behavior to insert a cast that may produce NULL 
>>> or to
>>>   fail at runtime?
>>>   - Ryan: So analysis and runtime behaviors can’t be separate?
>>>   - Matt: Analysis behavior is influenced by behavior at runtime.
>>>   Maybe the vote should cover both?
>>>   - Russell: (linked to the standard) There are 3 steps: if numeric
>>>   and same type, use the data value. If the value can be rounded or
>>>   truncated, round or truncate. Otherwise, throw an exception that a 
>>> value
>>>   can’t be cast. These are runtime requirements.
>>>   - Ryan: Another consideration is that we can make Spark more
>>>   permissive, but can’t make Spark more strict in future releases.
>>>   - Matt: v1 silently corrupts data
>>>   - Russell: ANSI is fine, as long as the runtime matches (is
>>>   ANSI). Don’t tell people it’s ANSI and not do ANSI completely.
>>>   - Gengliang: people are concerned about long-running jobs failing
>>>   at the end
>>>   - Ryan: That’s okay because they can change the defaults: use
>>>   strict analysis-time validation, or allow casts to produce NULL 
>>> values.
>>>   - Matt: As long as this is well documented, it should be fine
>>>   - Ryan: Can we run tests to find out what exactly the behavior is?
>>>   - Gengliang: sqlfiddle.com
>>>   - Russell ran tests in MySQL and Postgres. Both threw runtime
>>>   failures.
>>>   - Matt: Let’s move on, but add the runtime behavior to the VOTE
>>>- Identifier resolution and table lookup
>>>   - Ryan: recent changes merged identifier resolution and table
>>>   lookup together because identifiers owned by the session catalog need 
>>> to be
>>>   loaded to find out whether to use v1 or v2 plans. I think this should 
>>> be
>>>   separated so that identifier resolution happens independently to 
>>> ensure
>>>   that the two separate tasks don’t end up getting done at the same 
>>> time and
>>>   over-complicating the analyzer.
>>>- SHOW NAMESPACES - Ready for final review
>>>- DataFrameWriterV2:
>>>   - Ryan: Tests failed after passing on the PR. Anyone know why
>>>   that would happen?
>>>   - Gengliang: tests 

Re: DSv2 sync - 4 September 2019

2019-09-09 Thread Wenchen Fan
Hi Nicholas,

You are talking about a different thing. The PERMISSIVE mode is the failure
mode for reading text-based data source (json, csv, etc.). It's not the
general failure mode for Spark table insertion.

I agree with you that the PERMISSIVE mode is hard to use. Feel free to open
a JIRA ticket if you have some better ideas.

Thanks,
Wenchen

On Mon, Sep 9, 2019 at 12:46 AM Nicholas Chammas 
wrote:

> A quick question about failure modes, as a casual observer of the DSv2
> effort:
>
> I was considering filing a JIRA ticket about enhancing the DataFrameReader
> to include the failure *reason* in addition to the corrupt record when
> the mode is PERMISSIVE. So if you are loading a CSV, for example, and a
> value cannot be automatically cast to the type you specify in the schema,
> you'll get the corrupt record in the column configured by
> columnNameOfCorruptRecord, but you'll also get some detail about what
> exactly made the record corrupt, perhaps in a new column specified by
> something like columnNameOfCorruptReason.
>
> Is this an enhancement that would be possible in DSv2?
>
> On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue 
> wrote:
>
>> Here are my notes from the latest sync. Feel free to reply with
>> clarifications if I’ve missed anything.
>>
>> *Attendees*:
>>
>> Ryan Blue
>> John Zhuge
>> Russell Spitzer
>> Matt Cheah
>> Gengliang Wang
>> Priyanka Gomatam
>> Holden Karau
>>
>> *Topics*:
>>
>>- DataFrameWriterV2 insert vs append (recap)
>>- ANSI and strict modes for inserting casts
>>- Separating identifier resolution from table lookup
>>- Open PRs
>>   - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>>   - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>>   - TableProvider API update -
>>   https://github.com/apache/spark/pull/25651
>>   - UPDATE - https://github.com/apache/spark/pull/25626
>>
>> *Discussion*:
>>
>>- DataFrameWriterV2 insert vs append discussion recapped the
>>agreement from last sync
>>- ANSI and strict modes for inserting casts:
>>   - Russell: Failure modes are important. ANSI behavior is to fail
>>   at runtime, not analysis time. If a cast is allowed, but doesn’t throw 
>> an
>>   exception at runtime then this can’t be considered ANSI behavior.
>>   - Gengliang: ANSI adds the cast
>>   - Matt: Sounds like there are two conflicting views of the world.
>>   Is the default ANSI behavior to insert a cast that may produce NULL or 
>> to
>>   fail at runtime?
>>   - Ryan: So analysis and runtime behaviors can’t be separate?
>>   - Matt: Analysis behavior is influenced by behavior at runtime.
>>   Maybe the vote should cover both?
>>   - Russell: (linked to the standard) There are 3 steps: if numeric
>>   and same type, use the data value. If the value can be rounded or
>>   truncated, round or truncate. Otherwise, throw an exception that a 
>> value
>>   can’t be cast. These are runtime requirements.
>>   - Ryan: Another consideration is that we can make Spark more
>>   permissive, but can’t make Spark more strict in future releases.
>>   - Matt: v1 silently corrupts data
>>   - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
>>   Don’t tell people it’s ANSI and not do ANSI completely.
>>   - Gengliang: people are concerned about long-running jobs failing
>>   at the end
>>   - Ryan: That’s okay because they can change the defaults: use
>>   strict analysis-time validation, or allow casts to produce NULL values.
>>   - Matt: As long as this is well documented, it should be fine
>>   - Ryan: Can we run tests to find out what exactly the behavior is?
>>   - Gengliang: sqlfiddle.com
>>   - Russell ran tests in MySQL and Postgres. Both threw runtime
>>   failures.
>>   - Matt: Let’s move on, but add the runtime behavior to the VOTE
>>- Identifier resolution and table lookup
>>   - Ryan: recent changes merged identifier resolution and table
>>   lookup together because identifiers owned by the session catalog need 
>> to be
>>   loaded to find out whether to use v1 or v2 plans. I think this should 
>> be
>>   separated so that identifier resolution happens independently to ensure
>>   that the two separate tasks don’t end up getting done at the same time 
>> and
>>   over-complicating the analyzer.
>>- SHOW NAMESPACES - Ready for final review
>>- DataFrameWriterV2:
>>   - Ryan: Tests failed after passing on the PR. Anyone know why that
>>   would happen?
>>   - Gengliang: tests failed in maven
>>   - Holden: PR validation runs SBT tests
>>- TableProvider API update: skipped because Wenchen didn’t make it
>>- UPDATE support PR
>>   - Ryan: There is a PR to add a SQL UPDATE command, but it
>>   delegates entirely to the data source, which seems strange.
>>   - Matt: What is Spark’s purpose here? Why would 

Re: DSv2 sync - 4 September 2019

2019-09-08 Thread Nicholas Chammas
A quick question about failure modes, as a casual observer of the DSv2
effort:

I was considering filing a JIRA ticket about enhancing the DataFrameReader
to include the failure *reason* in addition to the corrupt record when the
mode is PERMISSIVE. So if you are loading a CSV, for example, and a value
cannot be automatically cast to the type you specify in the schema, you'll
get the corrupt record in the column configured by
columnNameOfCorruptRecord, but you'll also get some detail about what
exactly made the record corrupt, perhaps in a new column specified by
something like columnNameOfCorruptReason.

Is this an enhancement that would be possible in DSv2?

On Fri, Sep 6, 2019 at 6:28 PM Ryan Blue  wrote:

> Here are my notes from the latest sync. Feel free to reply with
> clarifications if I’ve missed anything.
>
> *Attendees*:
>
> Ryan Blue
> John Zhuge
> Russell Spitzer
> Matt Cheah
> Gengliang Wang
> Priyanka Gomatam
> Holden Karau
>
> *Topics*:
>
>- DataFrameWriterV2 insert vs append (recap)
>- ANSI and strict modes for inserting casts
>- Separating identifier resolution from table lookup
>- Open PRs
>   - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
>   - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
>   - TableProvider API update -
>   https://github.com/apache/spark/pull/25651
>   - UPDATE - https://github.com/apache/spark/pull/25626
>
> *Discussion*:
>
>- DataFrameWriterV2 insert vs append discussion recapped the agreement
>from last sync
>- ANSI and strict modes for inserting casts:
>   - Russell: Failure modes are important. ANSI behavior is to fail at
>   runtime, not analysis time. If a cast is allowed, but doesn’t throw an
>   exception at runtime then this can’t be considered ANSI behavior.
>   - Gengliang: ANSI adds the cast
>   - Matt: Sounds like there are two conflicting views of the world.
>   Is the default ANSI behavior to insert a cast that may produce NULL or 
> to
>   fail at runtime?
>   - Ryan: So analysis and runtime behaviors can’t be separate?
>   - Matt: Analysis behavior is influenced by behavior at runtime.
>   Maybe the vote should cover both?
>   - Russell: (linked to the standard) There are 3 steps: if numeric
>   and same type, use the data value. If the value can be rounded or
>   truncated, round or truncate. Otherwise, throw an exception that a value
>   can’t be cast. These are runtime requirements.
>   - Ryan: Another consideration is that we can make Spark more
>   permissive, but can’t make Spark more strict in future releases.
>   - Matt: v1 silently corrupts data
>   - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
>   Don’t tell people it’s ANSI and not do ANSI completely.
>   - Gengliang: people are concerned about long-running jobs failing
>   at the end
>   - Ryan: That’s okay because they can change the defaults: use
>   strict analysis-time validation, or allow casts to produce NULL values.
>   - Matt: As long as this is well documented, it should be fine
>   - Ryan: Can we run tests to find out what exactly the behavior is?
>   - Gengliang: sqlfiddle.com
>   - Russell ran tests in MySQL and Postgres. Both threw runtime
>   failures.
>   - Matt: Let’s move on, but add the runtime behavior to the VOTE
>- Identifier resolution and table lookup
>   - Ryan: recent changes merged identifier resolution and table
>   lookup together because identifiers owned by the session catalog need 
> to be
>   loaded to find out whether to use v1 or v2 plans. I think this should be
>   separated so that identifier resolution happens independently to ensure
>   that the two separate tasks don’t end up getting done at the same time 
> and
>   over-complicating the analyzer.
>- SHOW NAMESPACES - Ready for final review
>- DataFrameWriterV2:
>   - Ryan: Tests failed after passing on the PR. Anyone know why that
>   would happen?
>   - Gengliang: tests failed in maven
>   - Holden: PR validation runs SBT tests
>- TableProvider API update: skipped because Wenchen didn’t make it
>- UPDATE support PR
>   - Ryan: There is a PR to add a SQL UPDATE command, but it delegates
>   entirely to the data source, which seems strange.
>   - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
>   statement only to pass it entirely to another engine?
>   - Ryan: It does make sense to do this. If Spark eventually supports
>   MERGE INTO and other row-level operations, then it makes sense to push 
> down
>   the operation to some sources, like JDBC. I just find it backward to add
>   the pushdown API before adding an implementation that handles this 
> inside
>   Spark — pushdown is usually an optimization.
>   - Russell: Would this be safe? Spark retries lots of 

DSv2 sync - 4 September 2019

2019-09-06 Thread Ryan Blue
Here are my notes from the latest sync. Feel free to reply with
clarifications if I’ve missed anything.

*Attendees*:

Ryan Blue
John Zhuge
Russell Spitzer
Matt Cheah
Gengliang Wang
Priyanka Gomatam
Holden Karau

*Topics*:

   - DataFrameWriterV2 insert vs append (recap)
   - ANSI and strict modes for inserting casts
   - Separating identifier resolution from table lookup
   - Open PRs
  - SHOW NAMESPACES - https://github.com/apache/spark/pull/25601
  - DataFrameWriterV2 - https://github.com/apache/spark/pull/25681
  - TableProvider API update -
  https://github.com/apache/spark/pull/25651
  - UPDATE - https://github.com/apache/spark/pull/25626

*Discussion*:

   - DataFrameWriterV2 insert vs append discussion recapped the agreement
   from last sync
   - ANSI and strict modes for inserting casts:
  - Russell: Failure modes are important. ANSI behavior is to fail at
  runtime, not analysis time. If a cast is allowed, but doesn’t throw an
  exception at runtime then this can’t be considered ANSI behavior.
  - Gengliang: ANSI adds the cast
  - Matt: Sounds like there are two conflicting views of the world. Is
  the default ANSI behavior to insert a cast that may produce NULL
or to fail
  at runtime?
  - Ryan: So analysis and runtime behaviors can’t be separate?
  - Matt: Analysis behavior is influenced by behavior at runtime. Maybe
  the vote should cover both?
  - Russell: (linked to the standard) There are 3 steps: if numeric and
  same type, use the data value. If the value can be rounded or truncated,
  round or truncate. Otherwise, throw an exception that a value can’t be
  cast. These are runtime requirements.
  - Ryan: Another consideration is that we can make Spark more
  permissive, but can’t make Spark more strict in future releases.
  - Matt: v1 silently corrupts data
  - Russell: ANSI is fine, as long as the runtime matches (is ANSI).
  Don’t tell people it’s ANSI and not do ANSI completely.
  - Gengliang: people are concerned about long-running jobs failing at
  the end
  - Ryan: That’s okay because they can change the defaults: use strict
  analysis-time validation, or allow casts to produce NULL values.
  - Matt: As long as this is well documented, it should be fine
  - Ryan: Can we run tests to find out what exactly the behavior is?
  - Gengliang: sqlfiddle.com
  - Russell ran tests in MySQL and Postgres. Both threw runtime
  failures.
  - Matt: Let’s move on, but add the runtime behavior to the VOTE
   - Identifier resolution and table lookup
  - Ryan: recent changes merged identifier resolution and table lookup
  together because identifiers owned by the session catalog need
to be loaded
  to find out whether to use v1 or v2 plans. I think this should
be separated
  so that identifier resolution happens independently to ensure
that the two
  separate tasks don’t end up getting done at the same time and
  over-complicating the analyzer.
   - SHOW NAMESPACES - Ready for final review
   - DataFrameWriterV2:
  - Ryan: Tests failed after passing on the PR. Anyone know why that
  would happen?
  - Gengliang: tests failed in maven
  - Holden: PR validation runs SBT tests
   - TableProvider API update: skipped because Wenchen didn’t make it
   - UPDATE support PR
  - Ryan: There is a PR to add a SQL UPDATE command, but it delegates
  entirely to the data source, which seems strange.
  - Matt: What is Spark’s purpose here? Why would Spark parse a SQL
  statement only to pass it entirely to another engine?
  - Ryan: It does make sense to do this. If Spark eventually supports
  MERGE INTO and other row-level operations, then it makes sense
to push down
  the operation to some sources, like JDBC. I just find it backward to add
  the pushdown API before adding an implementation that handles this inside
  Spark — pushdown is usually an optimization.
  - Russell: Would this be safe? Spark retries lots of operations.
  - Ryan: I think it would be safe because Spark won’t retry top-level
  operations and this is a single method call. Nothing would get retried.
  - Ryan: I’ll ask what the PR author’s use case is. Maybe that would
  help clarify why this is a good idea.

-- 
Ryan Blue
Software Engineer
Netflix