Re: DataSourceV2 sync notes - 20 Feb 2019

Stavros Kontopoulos Tue, 05 Mar 2019 09:22:42 -0800

Thanks Ryan!

On Tue, Mar 5, 2019 at 7:19 PM Ryan Blue <[email protected]> wrote:


> Everyone is welcome to join this discussion. Just send me an e-mail to get
> added to the invite.
>
> Stavros, I'll add you.
>
> rb
>
> On Tue, Mar 5, 2019 at 5:43 AM Stavros Kontopoulos <
> [email protected]> wrote:
>
>> Thanks for the update, is this meeting open for other people to join?
>>
>> Stavros
>>
>> On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue <[email protected]>
>> wrote:
>>
>>> Here are my notes from the DSv2 sync last night. As always, if you have
>>> corrections, please reply with them. And if you’d like to be included on
>>> the invite to participate in the next sync (6 March), send me an email.
>>>
>>> Here’s a quick summary of the topics where we had consensus last night:
>>>
>>>    - The behavior of v1 sources needs to be documented to come up with
>>>    a migration plan
>>>    - Spark 3.0 should include DSv2, even if it would delay the release
>>>    (pending community discussion and vote)
>>>    - Design for the v2 Catalog plugin system
>>>    - V2 catalog approach of separate TableCatalog, FunctionCatalog, and
>>>    ViewCatalog interfaces
>>>    - Common v2 Table metadata should be schema, partitioning, and
>>>    string-map of properties; leaving out sorting for now. (Ready to vote on
>>>    metadata SPIP.)
>>>
>>> *Topics*:
>>>
>>>    - Issues raised by ORC v2 commit
>>>    - Migration to v2 sources
>>>    - Roadmap and current blockers
>>>    - Catalog plugin system
>>>    - Catalog API separate interfaces approach
>>>    - Catalog API metadata (schema, partitioning, and properties)
>>>    - Public catalog API proposal
>>>
>>> *Notes*:
>>>
>>>    - Issues raised by ORC v2 commit
>>>       - Ryan: Disabled change to use v2 by default in PR for overwrite
>>>       plans: tests rely on CTAS, which is not implemented in v2.
>>>       - Wenchen: suggested using a StagedTable to work around not
>>>       having a CTAS finished. TableProvider could create a staged table.
>>>       - Ryan: Using StagedTable doesn’t make sense to me. It was
>>>       intended to solve a different problem (atomicity). Adding an 
>>> interface to
>>>       create a staged table either requires the same metadata as CTAS or 
>>> requires
>>>       a blank staged table, which isn’t the same concept: these staged 
>>> tables
>>>       would behave entirely differently than the ones for atomic operations.
>>>       Better to spend time getting CTAS done and work through the long-term 
>>> plan
>>>       than to hack around it.
>>>       - Second issue raised by the ORC work: how to support tables that
>>>       use different validations.
>>>       - Ryan: What Gengliang’s PRs are missing is a clear definition of
>>>       what tables require different validation and what that validation 
>>> should
>>>       be. In some cases, CTAS is validated against existing data [Ed: this 
>>> is
>>>       PreprocessTableCreation] and in some cases, Append has no validation
>>>       because the table doesn’t exist. What isn’t clear is when these 
>>> validations
>>>       are applied.
>>>       - Ryan: Without knowing exactly how v1 works, we can’t mirror
>>>       that behavior in v2. Building a way to turn off validation is going 
>>> to be
>>>       needed, but is insufficient without knowing when to apply it.
>>>       - Ryan: We also don’t know if it will make sense to maintain all
>>>       of these rules to mimic v1 behavior. In v1, CTAS and Append can both 
>>> write
>>>       to existing tables, but use different rules to validate. What are the
>>>       differences between them? It is unlikely that Spark will support both 
>>> as
>>>       options, if that is even possible. [Ed: see later discussion on 
>>> migration
>>>       that continues this.]
>>>       - Gengliang: Using SaveMode is an option.
>>>       - Ryan: Using SaveMode only appears to fix this, but doesn’t
>>>       actually test v2. Using SaveMode appears to work because it disables 
>>> all
>>>       validation and uses code from v1 that will “create” tables by 
>>> writing. But
>>>       this isn’t helpful for the v2 goal of having defined and reliable 
>>> behavior.
>>>       - Gengliang: SaveMode is not correctly translated. Append could
>>>       mean AppendData or CTAS.
>>>       - Ryan: This is why we need to focus on finishing the v2 plans:
>>>       so we can correctly translate the SaveMode into the right plan. That
>>>       depends on having a catalog for CTAS and to check the existence of a 
>>> table.
>>>       - Wenchen: Catalog doesn’t support path tables, so how does this
>>>       help?
>>>       - Ryan: The multi-catalog identifiers proposal includes a way to
>>>       pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This 
>>> allows a
>>>       catalog implementation to handle path-based tables. The identifier 
>>> will
>>>       also have a method to test whether the identifier is a path 
>>> identifier and
>>>       catalogs are not required to support path identifiers.
>>>    - Migration to v2 sources
>>>       - Hyukjin: Once the ORC upgrade is done how will we move from v1
>>>       to v2?
>>>       - Ryan: We will need to develop v1 and v2 in parallel. There are
>>>       many code paths in v1 and we don’t know exactly what they do. We 
>>> first need
>>>       to know what they do and make a migration plan after that.
>>>       - Hyukjin: What if there are many behavior differences? Will this
>>>       require an API to opt in for each one?
>>>       - Ryan: Without knowing how v1 behaves, we can only speculate.
>>>       But I don’t think that we will want to support many of these special 
>>> cases.
>>>       That is a lot of work and maintenance.
>>>       - Gengliang: When can we change the default to v2? Until we
>>>       change the default, v2 is not tested. The v2 work is blocked by this.
>>>       - Ryan: v2 work should not be blocked by finishing CTAS and other
>>>       plans. This can proceed in parallel.
>>>       - Matt: We don’t need to use the existing tests, we can add tests
>>>       for v2 below the DF writer level.
>>>       - Gengliang: But those tests would not be end-to-end.
>>>       - Ryan: For end-to-end tests, we should add a new DataFrame write
>>>       API. That is going to be needed to move entirely to v2 and drop v1 
>>> behavior
>>>       hacks anyway. Adding it now fixes both problems.
>>>       - Matt: Supports the idea of adding the DF v2 write API now.
>>>       - *Consensus for documenting the behavior of v1* (Gengliang will
>>>       work on this because it affects his work.)
>>>    - Roadmap:
>>>       - Matt (I think): Community should commit to finishing planned
>>>       work on DSv2 for Spark 3.0.
>>>       - Ryan: Agree, we can’t wait forever and lots of this work has
>>>       been pending for a year now. If this doesn’t make it into 3.0, we 
>>> will need
>>>       to consider other options.
>>>       - Felix: Goal should be 3.0 even if it requires delaying the
>>>       release.
>>>       - *Consensus: Spark 3.0 should include DSv2, even if it requires
>>>       delaying the release.* Ryan will start a discussion thread about
>>>       committing to DSv2 in Spark 3.0.
>>>       - Matt: What work is outstanding DSv2?
>>>       - Ryan: Addition of TableCatalog API, catalog plugin system, CTAS
>>>       implementation.
>>>       - Matt: What blocks those things?
>>>       - Ryan: Next blocker is agreement on catalog plugin system,
>>>       catalog API approach (separate TableCatalog, FunctionCatalog, etc.), 
>>> and
>>>       TableCatalog metadata.
>>>       - *Consensus formed for catalog plugin system* (as previously
>>>       discussed)
>>>       - *Consensus formed for catalog API approach*
>>>       - *Consensus formed for TableCatalog metadata in SPIP* - Ryan
>>>       will start a vote thread for this SPIP
>>>       - Ryan: The metadata SPIP also includes a public API that isn’t
>>>       required. Will move to implementation sketch so it is informational.
>>>       - Wenchen: InternalRow is written in Scala and needs a stable API
>>>       - Ryan: Can we do the InternalRow fix later?
>>>       - Wenchen: Yes, not a blocker.
>>>       - Ryan: Table metadata also contains sort information.
>>>       - Wenchen: Bucketing contains sort information, but it isn’t used
>>>       because it applies only to single files.
>>>       - *Consensus formed not including sorts in v2 table metadata.*
>>>
>>> *Attendees*:
>>> Ryan Blue
>>> John Zhuge
>>> Donjoon Hyun
>>> Felix Cheung
>>> Gengliang Wang
>>> Hyukji Kwon
>>> Jacky Lee
>>> Jamison Bennett
>>> Matt Cheah
>>> Yifei Huang
>>> Russel Spitzer
>>> Wenchen Fan
>>> Yuanjian Li
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceV2 sync notes - 20 Feb 2019

Reply via email to