Thanks Ryan! On Tue, Mar 5, 2019 at 7:19 PM Ryan Blue <rb...@netflix.com> wrote:
> Everyone is welcome to join this discussion. Just send me an e-mail to get > added to the invite. > > Stavros, I'll add you. > > rb > > On Tue, Mar 5, 2019 at 5:43 AM Stavros Kontopoulos < > stavros.kontopou...@lightbend.com> wrote: > >> Thanks for the update, is this meeting open for other people to join? >> >> Stavros >> >> On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Here are my notes from the DSv2 sync last night. As always, if you have >>> corrections, please reply with them. And if you’d like to be included on >>> the invite to participate in the next sync (6 March), send me an email. >>> >>> Here’s a quick summary of the topics where we had consensus last night: >>> >>> - The behavior of v1 sources needs to be documented to come up with >>> a migration plan >>> - Spark 3.0 should include DSv2, even if it would delay the release >>> (pending community discussion and vote) >>> - Design for the v2 Catalog plugin system >>> - V2 catalog approach of separate TableCatalog, FunctionCatalog, and >>> ViewCatalog interfaces >>> - Common v2 Table metadata should be schema, partitioning, and >>> string-map of properties; leaving out sorting for now. (Ready to vote on >>> metadata SPIP.) >>> >>> *Topics*: >>> >>> - Issues raised by ORC v2 commit >>> - Migration to v2 sources >>> - Roadmap and current blockers >>> - Catalog plugin system >>> - Catalog API separate interfaces approach >>> - Catalog API metadata (schema, partitioning, and properties) >>> - Public catalog API proposal >>> >>> *Notes*: >>> >>> - Issues raised by ORC v2 commit >>> - Ryan: Disabled change to use v2 by default in PR for overwrite >>> plans: tests rely on CTAS, which is not implemented in v2. >>> - Wenchen: suggested using a StagedTable to work around not >>> having a CTAS finished. TableProvider could create a staged table. >>> - Ryan: Using StagedTable doesn’t make sense to me. It was >>> intended to solve a different problem (atomicity). Adding an >>> interface to >>> create a staged table either requires the same metadata as CTAS or >>> requires >>> a blank staged table, which isn’t the same concept: these staged >>> tables >>> would behave entirely differently than the ones for atomic operations. >>> Better to spend time getting CTAS done and work through the long-term >>> plan >>> than to hack around it. >>> - Second issue raised by the ORC work: how to support tables that >>> use different validations. >>> - Ryan: What Gengliang’s PRs are missing is a clear definition of >>> what tables require different validation and what that validation >>> should >>> be. In some cases, CTAS is validated against existing data [Ed: this >>> is >>> PreprocessTableCreation] and in some cases, Append has no validation >>> because the table doesn’t exist. What isn’t clear is when these >>> validations >>> are applied. >>> - Ryan: Without knowing exactly how v1 works, we can’t mirror >>> that behavior in v2. Building a way to turn off validation is going >>> to be >>> needed, but is insufficient without knowing when to apply it. >>> - Ryan: We also don’t know if it will make sense to maintain all >>> of these rules to mimic v1 behavior. In v1, CTAS and Append can both >>> write >>> to existing tables, but use different rules to validate. What are the >>> differences between them? It is unlikely that Spark will support both >>> as >>> options, if that is even possible. [Ed: see later discussion on >>> migration >>> that continues this.] >>> - Gengliang: Using SaveMode is an option. >>> - Ryan: Using SaveMode only appears to fix this, but doesn’t >>> actually test v2. Using SaveMode appears to work because it disables >>> all >>> validation and uses code from v1 that will “create” tables by >>> writing. But >>> this isn’t helpful for the v2 goal of having defined and reliable >>> behavior. >>> - Gengliang: SaveMode is not correctly translated. Append could >>> mean AppendData or CTAS. >>> - Ryan: This is why we need to focus on finishing the v2 plans: >>> so we can correctly translate the SaveMode into the right plan. That >>> depends on having a catalog for CTAS and to check the existence of a >>> table. >>> - Wenchen: Catalog doesn’t support path tables, so how does this >>> help? >>> - Ryan: The multi-catalog identifiers proposal includes a way to >>> pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This >>> allows a >>> catalog implementation to handle path-based tables. The identifier >>> will >>> also have a method to test whether the identifier is a path >>> identifier and >>> catalogs are not required to support path identifiers. >>> - Migration to v2 sources >>> - Hyukjin: Once the ORC upgrade is done how will we move from v1 >>> to v2? >>> - Ryan: We will need to develop v1 and v2 in parallel. There are >>> many code paths in v1 and we don’t know exactly what they do. We >>> first need >>> to know what they do and make a migration plan after that. >>> - Hyukjin: What if there are many behavior differences? Will this >>> require an API to opt in for each one? >>> - Ryan: Without knowing how v1 behaves, we can only speculate. >>> But I don’t think that we will want to support many of these special >>> cases. >>> That is a lot of work and maintenance. >>> - Gengliang: When can we change the default to v2? Until we >>> change the default, v2 is not tested. The v2 work is blocked by this. >>> - Ryan: v2 work should not be blocked by finishing CTAS and other >>> plans. This can proceed in parallel. >>> - Matt: We don’t need to use the existing tests, we can add tests >>> for v2 below the DF writer level. >>> - Gengliang: But those tests would not be end-to-end. >>> - Ryan: For end-to-end tests, we should add a new DataFrame write >>> API. That is going to be needed to move entirely to v2 and drop v1 >>> behavior >>> hacks anyway. Adding it now fixes both problems. >>> - Matt: Supports the idea of adding the DF v2 write API now. >>> - *Consensus for documenting the behavior of v1* (Gengliang will >>> work on this because it affects his work.) >>> - Roadmap: >>> - Matt (I think): Community should commit to finishing planned >>> work on DSv2 for Spark 3.0. >>> - Ryan: Agree, we can’t wait forever and lots of this work has >>> been pending for a year now. If this doesn’t make it into 3.0, we >>> will need >>> to consider other options. >>> - Felix: Goal should be 3.0 even if it requires delaying the >>> release. >>> - *Consensus: Spark 3.0 should include DSv2, even if it requires >>> delaying the release.* Ryan will start a discussion thread about >>> committing to DSv2 in Spark 3.0. >>> - Matt: What work is outstanding DSv2? >>> - Ryan: Addition of TableCatalog API, catalog plugin system, CTAS >>> implementation. >>> - Matt: What blocks those things? >>> - Ryan: Next blocker is agreement on catalog plugin system, >>> catalog API approach (separate TableCatalog, FunctionCatalog, etc.), >>> and >>> TableCatalog metadata. >>> - *Consensus formed for catalog plugin system* (as previously >>> discussed) >>> - *Consensus formed for catalog API approach* >>> - *Consensus formed for TableCatalog metadata in SPIP* - Ryan >>> will start a vote thread for this SPIP >>> - Ryan: The metadata SPIP also includes a public API that isn’t >>> required. Will move to implementation sketch so it is informational. >>> - Wenchen: InternalRow is written in Scala and needs a stable API >>> - Ryan: Can we do the InternalRow fix later? >>> - Wenchen: Yes, not a blocker. >>> - Ryan: Table metadata also contains sort information. >>> - Wenchen: Bucketing contains sort information, but it isn’t used >>> because it applies only to single files. >>> - *Consensus formed not including sorts in v2 table metadata.* >>> >>> *Attendees*: >>> Ryan Blue >>> John Zhuge >>> Donjoon Hyun >>> Felix Cheung >>> Gengliang Wang >>> Hyukji Kwon >>> Jacky Lee >>> Jamison Bennett >>> Matt Cheah >>> Yifei Huang >>> Russel Spitzer >>> Wenchen Fan >>> Yuanjian Li >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> >> >> > > -- > Ryan Blue > Software Engineer > Netflix >