Re: DataSourceV2 sync, 17 April 2019

2019-04-29 Thread Ryan Blue
That is mostly correct. V2 standardizes the behavior of logical operations
like CTAS across data sources, so it isn't compatible with v1 behavior.
Consequently, we can't just move to v2 easily. We have to maintain both in
parallel and eventually deprecate v1.

We are aiming to have a working v2 in Spark 3.0, but the community has not
committed to this goal. Support may be incomplete.

rb

On Sat, Apr 27, 2019 at 7:13 AM Jean Georges Perrin  wrote:

> This may be completely inappropriate and I apologize if it is,
> nevertheless, I am trying to get some clarification about the current
> status of DS.
>
> Please tell me where I am wrong:
>
> Currently, the stable API is v1.
> There is a v2 DS API, but it is not widely used.
> The group is working on a “new” v2 API that will be available after the
> release of Spark v3.
>
> jg
>
> --
> Jean Georges Perrin
> j...@jgp.net
>
>
>
> On Apr 19, 2019, at 10:10, Ryan Blue  wrote:
>
> Here are my notes from the last DSv2 sync. As always:
>
>- If you’d like to attend the sync, send me an email and I’ll add you
>to the invite. Everyone is welcome.
>- These notes are what I wrote down and remember. If you have
>corrections or comments, please reply.
>
> *Topics*:
>
>- TableCatalog PR #24246: https://github.com/apache/spark/pull/24246
>- Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233
>- Streaming capabilities PR #24129:
>https://github.com/apache/spark/pull/24129
>
> *Attendees*:
>
> Ryan Blue
> John Zhuge
> Matt Cheah
> Yifei Huang
> Bruce Robbins
> Jamison Bennett
> Russell Spitzer
> Wenchen Fan
> Yuanjian Li
>
> (and others who arrived after the start)
>
> *Discussion*:
>
>- TableCatalog PR: https://github.com/apache/spark/pull/24246
>   - Wenchen and Matt had just reviewed the PR. Mostly what was in the
>   SPIP so not much discussion of content.
>   - Wenchen: Easier to review if the changes to move Table and
>   TableCapability were in a separate PR (mostly import changes)
>   - Ryan will open a separate PR for the move [Ed: #24410]
>   - Russell: How should caching work? Has hit lots of problems with
>   Spark caching data and getting out of date
>   - Ryan: Spark should always call into the catalog and not cache to
>   avoid those problems. However, Spark should ensure that it uses the same
>   instance of a Table for all scans in the same query, for consistent
>   self-joins.
>   - Some discussion of self joins. Conclusion was that we don’t need
>   to worry about this yet because it is unlikely.
>   - Wenchen: should this include the namespace methods?
>   - Ryan: No, those are a separate concern and can be added in a
>   parallel PR.
>- Remove SaveMode PR: https://github.com/apache/spark/pull/24233
>   - Wenchen: PR is on hold waiting for streaming capabilities,
>   #24129, because the Noop sink doesn’t validate schema
>   - Wenchen will open a PR to add a capability to opt out of schema
>   validation, then come back to this PR.
>- Streaming capabilities PR: https://github.com/apache/spark/pull/24129
>   - Ryan: This PR needs validation in the analyzer. The analyzer is
>   where validations should exist, or else validations must be copied into
>   every code path that produces a streaming plan.
>   - Wenchen: the write check can’t be written because the write node
>   is never passed to the analyzer. Fixing that is a larger problem.
>   - Ryan: Agree that refactoring to pass the write node to the
>   analyzer should be separate.
>   - Wenchen: a check to ensure that either microbatch or continuous
>   can be used is hard because some sources may fall back
>   - Ryan: By the time this check runs, fallback has happened. Do v1
>   sources support continuous mode?
>   - Wenchen: No, v1 doesn’t support continuous
>   - Ryan: Then this can be written to assume that v1 sources only
>   support microbatch mode.
>   - Wenchen will add this check
>   - Wenchen: the check that tables in a v2 streaming relation support
>   either microbatch or continuous won’t catch anything and are unnecessary
>   - Ryan: These checks still need to be in the analyzer so future
>   uses do not break. We had the same problem moving to v2: because schema
>   checks were specific to DataSource code paths, they were overlooked when
>   adding v2. Running validations in the analyzer avoids problems like 
> this.
>   - Wenchen will add the validation.
>- Matt: Will v2 be ready in time for the 3.0 release?
>   - Ryan: Once #24246 is in, we can work on PRs in parallel, but it
>   is not looking good.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: DataSourceV2 sync, 17 April 2019

2019-04-27 Thread Jean Georges Perrin
This may be completely inappropriate and I apologize if it is, nevertheless, I 
am trying to get some clarification about the current status of DS.

Please tell me where I am wrong:

Currently, the stable API is v1.
There is a v2 DS API, but it is not widely used.
The group is working on a “new” v2 API that will be available after the release 
of Spark v3.

jg

--
Jean Georges Perrin
j...@jgp.net



> On Apr 19, 2019, at 10:10, Ryan Blue  wrote:
> 
> Here are my notes from the last DSv2 sync. As always:
> 
> If you’d like to attend the sync, send me an email and I’ll add you to the 
> invite. Everyone is welcome.
> These notes are what I wrote down and remember. If you have corrections or 
> comments, please reply.
> Topics:
> 
> TableCatalog PR #24246: https://github.com/apache/spark/pull/24246 
> 
> Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233 
> 
> Streaming capabilities PR #24129: https://github.com/apache/spark/pull/24129 
> 
> Attendees:
> 
> Ryan Blue
> John Zhuge
> Matt Cheah
> Yifei Huang
> Bruce Robbins
> Jamison Bennett
> Russell Spitzer
> Wenchen Fan
> Yuanjian Li
> 
> (and others who arrived after the start)
> 
> Discussion:
> 
> TableCatalog PR: https://github.com/apache/spark/pull/24246 
> 
> Wenchen and Matt had just reviewed the PR. Mostly what was in the SPIP so not 
> much discussion of content.
> Wenchen: Easier to review if the changes to move Table and TableCapability 
> were in a separate PR (mostly import changes)
> Ryan will open a separate PR for the move [Ed: #24410]
> Russell: How should caching work? Has hit lots of problems with Spark caching 
> data and getting out of date
> Ryan: Spark should always call into the catalog and not cache to avoid those 
> problems. However, Spark should ensure that it uses the same instance of a 
> Table for all scans in the same query, for consistent self-joins.
> Some discussion of self joins. Conclusion was that we don’t need to worry 
> about this yet because it is unlikely.
> Wenchen: should this include the namespace methods?
> Ryan: No, those are a separate concern and can be added in a parallel PR.
> Remove SaveMode PR: https://github.com/apache/spark/pull/24233 
> 
> Wenchen: PR is on hold waiting for streaming capabilities, #24129, because 
> the Noop sink doesn’t validate schema
> Wenchen will open a PR to add a capability to opt out of schema validation, 
> then come back to this PR.
> Streaming capabilities PR: https://github.com/apache/spark/pull/24129 
> 
> Ryan: This PR needs validation in the analyzer. The analyzer is where 
> validations should exist, or else validations must be copied into every code 
> path that produces a streaming plan.
> Wenchen: the write check can’t be written because the write node is never 
> passed to the analyzer. Fixing that is a larger problem.
> Ryan: Agree that refactoring to pass the write node to the analyzer should be 
> separate.
> Wenchen: a check to ensure that either microbatch or continuous can be used 
> is hard because some sources may fall back
> Ryan: By the time this check runs, fallback has happened. Do v1 sources 
> support continuous mode?
> Wenchen: No, v1 doesn’t support continuous
> Ryan: Then this can be written to assume that v1 sources only support 
> microbatch mode.
> Wenchen will add this check
> Wenchen: the check that tables in a v2 streaming relation support either 
> microbatch or continuous won’t catch anything and are unnecessary
> Ryan: These checks still need to be in the analyzer so future uses do not 
> break. We had the same problem moving to v2: because schema checks were 
> specific to DataSource code paths, they were overlooked when adding v2. 
> Running validations in the analyzer avoids problems like this.
> Wenchen will add the validation.
> Matt: Will v2 be ready in time for the 3.0 release?
> Ryan: Once #24246 is in, we can work on PRs in parallel, but it is not 
> looking good.
> -- 
> Ryan Blue
> Software Engineer
> Netflix



DataSourceV2 sync, 17 April 2019

2019-04-19 Thread Ryan Blue
Here are my notes from the last DSv2 sync. As always:

   - If you’d like to attend the sync, send me an email and I’ll add you to
   the invite. Everyone is welcome.
   - These notes are what I wrote down and remember. If you have
   corrections or comments, please reply.

*Topics*:

   - TableCatalog PR #24246: https://github.com/apache/spark/pull/24246
   - Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233
   - Streaming capabilities PR #24129:
   https://github.com/apache/spark/pull/24129

*Attendees*:

Ryan Blue
John Zhuge
Matt Cheah
Yifei Huang
Bruce Robbins
Jamison Bennett
Russell Spitzer
Wenchen Fan
Yuanjian Li

(and others who arrived after the start)

*Discussion*:

   - TableCatalog PR: https://github.com/apache/spark/pull/24246
  - Wenchen and Matt had just reviewed the PR. Mostly what was in the
  SPIP so not much discussion of content.
  - Wenchen: Easier to review if the changes to move Table and
  TableCapability were in a separate PR (mostly import changes)
  - Ryan will open a separate PR for the move [Ed: #24410]
  - Russell: How should caching work? Has hit lots of problems with
  Spark caching data and getting out of date
  - Ryan: Spark should always call into the catalog and not cache to
  avoid those problems. However, Spark should ensure that it uses the same
  instance of a Table for all scans in the same query, for consistent
  self-joins.
  - Some discussion of self joins. Conclusion was that we don’t need to
  worry about this yet because it is unlikely.
  - Wenchen: should this include the namespace methods?
  - Ryan: No, those are a separate concern and can be added in a
  parallel PR.
   - Remove SaveMode PR: https://github.com/apache/spark/pull/24233
  - Wenchen: PR is on hold waiting for streaming capabilities, #24129,
  because the Noop sink doesn’t validate schema
  - Wenchen will open a PR to add a capability to opt out of schema
  validation, then come back to this PR.
   - Streaming capabilities PR: https://github.com/apache/spark/pull/24129
  - Ryan: This PR needs validation in the analyzer. The analyzer is
  where validations should exist, or else validations must be copied into
  every code path that produces a streaming plan.
  - Wenchen: the write check can’t be written because the write node is
  never passed to the analyzer. Fixing that is a larger problem.
  - Ryan: Agree that refactoring to pass the write node to the analyzer
  should be separate.
  - Wenchen: a check to ensure that either microbatch or continuous can
  be used is hard because some sources may fall back
  - Ryan: By the time this check runs, fallback has happened. Do v1
  sources support continuous mode?
  - Wenchen: No, v1 doesn’t support continuous
  - Ryan: Then this can be written to assume that v1 sources only
  support microbatch mode.
  - Wenchen will add this check
  - Wenchen: the check that tables in a v2 streaming relation support
  either microbatch or continuous won’t catch anything and are unnecessary
  - Ryan: These checks still need to be in the analyzer so future uses
  do not break. We had the same problem moving to v2: because schema checks
  were specific to DataSource code paths, they were overlooked when adding
  v2. Running validations in the analyzer avoids problems like this.
  - Wenchen will add the validation.
   - Matt: Will v2 be ready in time for the 3.0 release?
  - Ryan: Once #24246 is in, we can work on PRs in parallel, but it is
  not looking good.

-- 
Ryan Blue
Software Engineer
Netflix