Everyone is welcome to join this discussion. Just send me an e-mail to get added to the invite.
Stavros, I'll add you. rb On Tue, Mar 5, 2019 at 5:43 AM Stavros Kontopoulos < stavros.kontopou...@lightbend.com> wrote: > Thanks for the update, is this meeting open for other people to join? > > Stavros > > On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Here are my notes from the DSv2 sync last night. As always, if you have >> corrections, please reply with them. And if you’d like to be included on >> the invite to participate in the next sync (6 March), send me an email. >> >> Here’s a quick summary of the topics where we had consensus last night: >> >> - The behavior of v1 sources needs to be documented to come up with a >> migration plan >> - Spark 3.0 should include DSv2, even if it would delay the release >> (pending community discussion and vote) >> - Design for the v2 Catalog plugin system >> - V2 catalog approach of separate TableCatalog, FunctionCatalog, and >> ViewCatalog interfaces >> - Common v2 Table metadata should be schema, partitioning, and >> string-map of properties; leaving out sorting for now. (Ready to vote on >> metadata SPIP.) >> >> *Topics*: >> >> - Issues raised by ORC v2 commit >> - Migration to v2 sources >> - Roadmap and current blockers >> - Catalog plugin system >> - Catalog API separate interfaces approach >> - Catalog API metadata (schema, partitioning, and properties) >> - Public catalog API proposal >> >> *Notes*: >> >> - Issues raised by ORC v2 commit >> - Ryan: Disabled change to use v2 by default in PR for overwrite >> plans: tests rely on CTAS, which is not implemented in v2. >> - Wenchen: suggested using a StagedTable to work around not having >> a CTAS finished. TableProvider could create a staged table. >> - Ryan: Using StagedTable doesn’t make sense to me. It was >> intended to solve a different problem (atomicity). Adding an interface >> to >> create a staged table either requires the same metadata as CTAS or >> requires >> a blank staged table, which isn’t the same concept: these staged tables >> would behave entirely differently than the ones for atomic operations. >> Better to spend time getting CTAS done and work through the long-term >> plan >> than to hack around it. >> - Second issue raised by the ORC work: how to support tables that >> use different validations. >> - Ryan: What Gengliang’s PRs are missing is a clear definition of >> what tables require different validation and what that validation >> should >> be. In some cases, CTAS is validated against existing data [Ed: this is >> PreprocessTableCreation] and in some cases, Append has no validation >> because the table doesn’t exist. What isn’t clear is when these >> validations >> are applied. >> - Ryan: Without knowing exactly how v1 works, we can’t mirror that >> behavior in v2. Building a way to turn off validation is going to be >> needed, but is insufficient without knowing when to apply it. >> - Ryan: We also don’t know if it will make sense to maintain all >> of these rules to mimic v1 behavior. In v1, CTAS and Append can both >> write >> to existing tables, but use different rules to validate. What are the >> differences between them? It is unlikely that Spark will support both >> as >> options, if that is even possible. [Ed: see later discussion on >> migration >> that continues this.] >> - Gengliang: Using SaveMode is an option. >> - Ryan: Using SaveMode only appears to fix this, but doesn’t >> actually test v2. Using SaveMode appears to work because it disables >> all >> validation and uses code from v1 that will “create” tables by writing. >> But >> this isn’t helpful for the v2 goal of having defined and reliable >> behavior. >> - Gengliang: SaveMode is not correctly translated. Append could >> mean AppendData or CTAS. >> - Ryan: This is why we need to focus on finishing the v2 plans: so >> we can correctly translate the SaveMode into the right plan. That >> depends >> on having a catalog for CTAS and to check the existence of a table. >> - Wenchen: Catalog doesn’t support path tables, so how does this >> help? >> - Ryan: The multi-catalog identifiers proposal includes a way to >> pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This >> allows a >> catalog implementation to handle path-based tables. The identifier will >> also have a method to test whether the identifier is a path identifier >> and >> catalogs are not required to support path identifiers. >> - Migration to v2 sources >> - Hyukjin: Once the ORC upgrade is done how will we move from v1 >> to v2? >> - Ryan: We will need to develop v1 and v2 in parallel. There are >> many code paths in v1 and we don’t know exactly what they do. We first >> need >> to know what they do and make a migration plan after that. >> - Hyukjin: What if there are many behavior differences? Will this >> require an API to opt in for each one? >> - Ryan: Without knowing how v1 behaves, we can only speculate. But >> I don’t think that we will want to support many of these special cases. >> That is a lot of work and maintenance. >> - Gengliang: When can we change the default to v2? Until we change >> the default, v2 is not tested. The v2 work is blocked by this. >> - Ryan: v2 work should not be blocked by finishing CTAS and other >> plans. This can proceed in parallel. >> - Matt: We don’t need to use the existing tests, we can add tests >> for v2 below the DF writer level. >> - Gengliang: But those tests would not be end-to-end. >> - Ryan: For end-to-end tests, we should add a new DataFrame write >> API. That is going to be needed to move entirely to v2 and drop v1 >> behavior >> hacks anyway. Adding it now fixes both problems. >> - Matt: Supports the idea of adding the DF v2 write API now. >> - *Consensus for documenting the behavior of v1* (Gengliang will >> work on this because it affects his work.) >> - Roadmap: >> - Matt (I think): Community should commit to finishing planned >> work on DSv2 for Spark 3.0. >> - Ryan: Agree, we can’t wait forever and lots of this work has >> been pending for a year now. If this doesn’t make it into 3.0, we will >> need >> to consider other options. >> - Felix: Goal should be 3.0 even if it requires delaying the >> release. >> - *Consensus: Spark 3.0 should include DSv2, even if it requires >> delaying the release.* Ryan will start a discussion thread about >> committing to DSv2 in Spark 3.0. >> - Matt: What work is outstanding DSv2? >> - Ryan: Addition of TableCatalog API, catalog plugin system, CTAS >> implementation. >> - Matt: What blocks those things? >> - Ryan: Next blocker is agreement on catalog plugin system, >> catalog API approach (separate TableCatalog, FunctionCatalog, etc.), >> and >> TableCatalog metadata. >> - *Consensus formed for catalog plugin system* (as previously >> discussed) >> - *Consensus formed for catalog API approach* >> - *Consensus formed for TableCatalog metadata in SPIP* - Ryan will >> start a vote thread for this SPIP >> - Ryan: The metadata SPIP also includes a public API that isn’t >> required. Will move to implementation sketch so it is informational. >> - Wenchen: InternalRow is written in Scala and needs a stable API >> - Ryan: Can we do the InternalRow fix later? >> - Wenchen: Yes, not a blocker. >> - Ryan: Table metadata also contains sort information. >> - Wenchen: Bucketing contains sort information, but it isn’t used >> because it applies only to single files. >> - *Consensus formed not including sorts in v2 table metadata.* >> >> *Attendees*: >> Ryan Blue >> John Zhuge >> Donjoon Hyun >> Felix Cheung >> Gengliang Wang >> Hyukji Kwon >> Jacky Lee >> Jamison Bennett >> Matt Cheah >> Yifei Huang >> Russel Spitzer >> Wenchen Fan >> Yuanjian Li >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > > -- Ryan Blue Software Engineer Netflix