Re: DataSourceV2 sync notes - 20 Feb 2019
Thanks Ryan! On Tue, Mar 5, 2019 at 7:19 PM Ryan Blue wrote: > Everyone is welcome to join this discussion. Just send me an e-mail to get > added to the invite. > > Stavros, I'll add you. > > rb > > On Tue, Mar 5, 2019 at 5:43 AM Stavros Kontopoulos < > stavros.kontopou...@lightbend.com> wrote: > >> Thanks for the update, is this meeting open for other people to join? >> >> Stavros >> >> On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue >> wrote: >> >>> Here are my notes from the DSv2 sync last night. As always, if you have >>> corrections, please reply with them. And if you’d like to be included on >>> the invite to participate in the next sync (6 March), send me an email. >>> >>> Here’s a quick summary of the topics where we had consensus last night: >>> >>>- The behavior of v1 sources needs to be documented to come up with >>>a migration plan >>>- Spark 3.0 should include DSv2, even if it would delay the release >>>(pending community discussion and vote) >>>- Design for the v2 Catalog plugin system >>>- V2 catalog approach of separate TableCatalog, FunctionCatalog, and >>>ViewCatalog interfaces >>>- Common v2 Table metadata should be schema, partitioning, and >>>string-map of properties; leaving out sorting for now. (Ready to vote on >>>metadata SPIP.) >>> >>> *Topics*: >>> >>>- Issues raised by ORC v2 commit >>>- Migration to v2 sources >>>- Roadmap and current blockers >>>- Catalog plugin system >>>- Catalog API separate interfaces approach >>>- Catalog API metadata (schema, partitioning, and properties) >>>- Public catalog API proposal >>> >>> *Notes*: >>> >>>- Issues raised by ORC v2 commit >>> - Ryan: Disabled change to use v2 by default in PR for overwrite >>> plans: tests rely on CTAS, which is not implemented in v2. >>> - Wenchen: suggested using a StagedTable to work around not >>> having a CTAS finished. TableProvider could create a staged table. >>> - Ryan: Using StagedTable doesn’t make sense to me. It was >>> intended to solve a different problem (atomicity). Adding an >>> interface to >>> create a staged table either requires the same metadata as CTAS or >>> requires >>> a blank staged table, which isn’t the same concept: these staged >>> tables >>> would behave entirely differently than the ones for atomic operations. >>> Better to spend time getting CTAS done and work through the long-term >>> plan >>> than to hack around it. >>> - Second issue raised by the ORC work: how to support tables that >>> use different validations. >>> - Ryan: What Gengliang’s PRs are missing is a clear definition of >>> what tables require different validation and what that validation >>> should >>> be. In some cases, CTAS is validated against existing data [Ed: this >>> is >>> PreprocessTableCreation] and in some cases, Append has no validation >>> because the table doesn’t exist. What isn’t clear is when these >>> validations >>> are applied. >>> - Ryan: Without knowing exactly how v1 works, we can’t mirror >>> that behavior in v2. Building a way to turn off validation is going >>> to be >>> needed, but is insufficient without knowing when to apply it. >>> - Ryan: We also don’t know if it will make sense to maintain all >>> of these rules to mimic v1 behavior. In v1, CTAS and Append can both >>> write >>> to existing tables, but use different rules to validate. What are the >>> differences between them? It is unlikely that Spark will support both >>> as >>> options, if that is even possible. [Ed: see later discussion on >>> migration >>> that continues this.] >>> - Gengliang: Using SaveMode is an option. >>> - Ryan: Using SaveMode only appears to fix this, but doesn’t >>> actually test v2. Using SaveMode appears to work because it disables >>> all >>> validation and uses code from v1 that will “create” tables by >>> writing. But >>> this isn’t helpful for the v2 goal of having defined and reliable >>> behavior. >>> - Gengliang: SaveMode is not correctly translated. Append could >>> mean AppendData or CTAS. >>> - Ryan: This is why we need to focus on finishing the v2 plans: >>> so we can correctly translate the SaveMode into the right plan. That >>> depends on having a catalog for CTAS and to check the existence of a >>> table. >>> - Wenchen: Catalog doesn’t support path tables, so how does this >>> help? >>> - Ryan: The multi-catalog identifiers proposal includes a way to >>> pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This >>> allows a >>> catalog implementation to handle path-based tables. The identifier >>> will >>> also have a method to test whether the identifier is a path >>> identifier and >>> catalogs are not required to support path
Re: DataSourceV2 sync notes - 20 Feb 2019
Everyone is welcome to join this discussion. Just send me an e-mail to get added to the invite. Stavros, I'll add you. rb On Tue, Mar 5, 2019 at 5:43 AM Stavros Kontopoulos < stavros.kontopou...@lightbend.com> wrote: > Thanks for the update, is this meeting open for other people to join? > > Stavros > > On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue > wrote: > >> Here are my notes from the DSv2 sync last night. As always, if you have >> corrections, please reply with them. And if you’d like to be included on >> the invite to participate in the next sync (6 March), send me an email. >> >> Here’s a quick summary of the topics where we had consensus last night: >> >>- The behavior of v1 sources needs to be documented to come up with a >>migration plan >>- Spark 3.0 should include DSv2, even if it would delay the release >>(pending community discussion and vote) >>- Design for the v2 Catalog plugin system >>- V2 catalog approach of separate TableCatalog, FunctionCatalog, and >>ViewCatalog interfaces >>- Common v2 Table metadata should be schema, partitioning, and >>string-map of properties; leaving out sorting for now. (Ready to vote on >>metadata SPIP.) >> >> *Topics*: >> >>- Issues raised by ORC v2 commit >>- Migration to v2 sources >>- Roadmap and current blockers >>- Catalog plugin system >>- Catalog API separate interfaces approach >>- Catalog API metadata (schema, partitioning, and properties) >>- Public catalog API proposal >> >> *Notes*: >> >>- Issues raised by ORC v2 commit >> - Ryan: Disabled change to use v2 by default in PR for overwrite >> plans: tests rely on CTAS, which is not implemented in v2. >> - Wenchen: suggested using a StagedTable to work around not having >> a CTAS finished. TableProvider could create a staged table. >> - Ryan: Using StagedTable doesn’t make sense to me. It was >> intended to solve a different problem (atomicity). Adding an interface >> to >> create a staged table either requires the same metadata as CTAS or >> requires >> a blank staged table, which isn’t the same concept: these staged tables >> would behave entirely differently than the ones for atomic operations. >> Better to spend time getting CTAS done and work through the long-term >> plan >> than to hack around it. >> - Second issue raised by the ORC work: how to support tables that >> use different validations. >> - Ryan: What Gengliang’s PRs are missing is a clear definition of >> what tables require different validation and what that validation >> should >> be. In some cases, CTAS is validated against existing data [Ed: this is >> PreprocessTableCreation] and in some cases, Append has no validation >> because the table doesn’t exist. What isn’t clear is when these >> validations >> are applied. >> - Ryan: Without knowing exactly how v1 works, we can’t mirror that >> behavior in v2. Building a way to turn off validation is going to be >> needed, but is insufficient without knowing when to apply it. >> - Ryan: We also don’t know if it will make sense to maintain all >> of these rules to mimic v1 behavior. In v1, CTAS and Append can both >> write >> to existing tables, but use different rules to validate. What are the >> differences between them? It is unlikely that Spark will support both >> as >> options, if that is even possible. [Ed: see later discussion on >> migration >> that continues this.] >> - Gengliang: Using SaveMode is an option. >> - Ryan: Using SaveMode only appears to fix this, but doesn’t >> actually test v2. Using SaveMode appears to work because it disables >> all >> validation and uses code from v1 that will “create” tables by writing. >> But >> this isn’t helpful for the v2 goal of having defined and reliable >> behavior. >> - Gengliang: SaveMode is not correctly translated. Append could >> mean AppendData or CTAS. >> - Ryan: This is why we need to focus on finishing the v2 plans: so >> we can correctly translate the SaveMode into the right plan. That >> depends >> on having a catalog for CTAS and to check the existence of a table. >> - Wenchen: Catalog doesn’t support path tables, so how does this >> help? >> - Ryan: The multi-catalog identifiers proposal includes a way to >> pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This >> allows a >> catalog implementation to handle path-based tables. The identifier will >> also have a method to test whether the identifier is a path identifier >> and >> catalogs are not required to support path identifiers. >>- Migration to v2 sources >> - Hyukjin: Once the ORC upgrade is done how will we move from v1 >> to v2? >> - Ryan: We will need to develop v1 and v2 in
Re: DataSourceV2 sync notes - 20 Feb 2019
Thanks for the update, is this meeting open for other people to join? Stavros On Thu, Feb 21, 2019 at 10:56 PM Ryan Blue wrote: > Here are my notes from the DSv2 sync last night. As always, if you have > corrections, please reply with them. And if you’d like to be included on > the invite to participate in the next sync (6 March), send me an email. > > Here’s a quick summary of the topics where we had consensus last night: > >- The behavior of v1 sources needs to be documented to come up with a >migration plan >- Spark 3.0 should include DSv2, even if it would delay the release >(pending community discussion and vote) >- Design for the v2 Catalog plugin system >- V2 catalog approach of separate TableCatalog, FunctionCatalog, and >ViewCatalog interfaces >- Common v2 Table metadata should be schema, partitioning, and >string-map of properties; leaving out sorting for now. (Ready to vote on >metadata SPIP.) > > *Topics*: > >- Issues raised by ORC v2 commit >- Migration to v2 sources >- Roadmap and current blockers >- Catalog plugin system >- Catalog API separate interfaces approach >- Catalog API metadata (schema, partitioning, and properties) >- Public catalog API proposal > > *Notes*: > >- Issues raised by ORC v2 commit > - Ryan: Disabled change to use v2 by default in PR for overwrite > plans: tests rely on CTAS, which is not implemented in v2. > - Wenchen: suggested using a StagedTable to work around not having > a CTAS finished. TableProvider could create a staged table. > - Ryan: Using StagedTable doesn’t make sense to me. It was intended > to solve a different problem (atomicity). Adding an interface to create > a > staged table either requires the same metadata as CTAS or requires a > blank > staged table, which isn’t the same concept: these staged tables would > behave entirely differently than the ones for atomic operations. Better > to > spend time getting CTAS done and work through the long-term plan than to > hack around it. > - Second issue raised by the ORC work: how to support tables that > use different validations. > - Ryan: What Gengliang’s PRs are missing is a clear definition of > what tables require different validation and what that validation should > be. In some cases, CTAS is validated against existing data [Ed: this is > PreprocessTableCreation] and in some cases, Append has no validation > because the table doesn’t exist. What isn’t clear is when these > validations > are applied. > - Ryan: Without knowing exactly how v1 works, we can’t mirror that > behavior in v2. Building a way to turn off validation is going to be > needed, but is insufficient without knowing when to apply it. > - Ryan: We also don’t know if it will make sense to maintain all of > these rules to mimic v1 behavior. In v1, CTAS and Append can both write > to > existing tables, but use different rules to validate. What are the > differences between them? It is unlikely that Spark will support both as > options, if that is even possible. [Ed: see later discussion on > migration > that continues this.] > - Gengliang: Using SaveMode is an option. > - Ryan: Using SaveMode only appears to fix this, but doesn’t > actually test v2. Using SaveMode appears to work because it disables all > validation and uses code from v1 that will “create” tables by writing. > But > this isn’t helpful for the v2 goal of having defined and reliable > behavior. > - Gengliang: SaveMode is not correctly translated. Append could > mean AppendData or CTAS. > - Ryan: This is why we need to focus on finishing the v2 plans: so > we can correctly translate the SaveMode into the right plan. That > depends > on having a catalog for CTAS and to check the existence of a table. > - Wenchen: Catalog doesn’t support path tables, so how does this > help? > - Ryan: The multi-catalog identifiers proposal includes a way to > pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This allows > a > catalog implementation to handle path-based tables. The identifier will > also have a method to test whether the identifier is a path identifier > and > catalogs are not required to support path identifiers. >- Migration to v2 sources > - Hyukjin: Once the ORC upgrade is done how will we move from v1 to > v2? > - Ryan: We will need to develop v1 and v2 in parallel. There are > many code paths in v1 and we don’t know exactly what they do. We first > need > to know what they do and make a migration plan after that. > - Hyukjin: What if there are many behavior differences? Will this > require an API to opt in for each one? > - Ryan: Without knowing how v1
DataSourceV2 sync notes - 20 Feb 2019
Here are my notes from the DSv2 sync last night. As always, if you have corrections, please reply with them. And if you’d like to be included on the invite to participate in the next sync (6 March), send me an email. Here’s a quick summary of the topics where we had consensus last night: - The behavior of v1 sources needs to be documented to come up with a migration plan - Spark 3.0 should include DSv2, even if it would delay the release (pending community discussion and vote) - Design for the v2 Catalog plugin system - V2 catalog approach of separate TableCatalog, FunctionCatalog, and ViewCatalog interfaces - Common v2 Table metadata should be schema, partitioning, and string-map of properties; leaving out sorting for now. (Ready to vote on metadata SPIP.) *Topics*: - Issues raised by ORC v2 commit - Migration to v2 sources - Roadmap and current blockers - Catalog plugin system - Catalog API separate interfaces approach - Catalog API metadata (schema, partitioning, and properties) - Public catalog API proposal *Notes*: - Issues raised by ORC v2 commit - Ryan: Disabled change to use v2 by default in PR for overwrite plans: tests rely on CTAS, which is not implemented in v2. - Wenchen: suggested using a StagedTable to work around not having a CTAS finished. TableProvider could create a staged table. - Ryan: Using StagedTable doesn’t make sense to me. It was intended to solve a different problem (atomicity). Adding an interface to create a staged table either requires the same metadata as CTAS or requires a blank staged table, which isn’t the same concept: these staged tables would behave entirely differently than the ones for atomic operations. Better to spend time getting CTAS done and work through the long-term plan than to hack around it. - Second issue raised by the ORC work: how to support tables that use different validations. - Ryan: What Gengliang’s PRs are missing is a clear definition of what tables require different validation and what that validation should be. In some cases, CTAS is validated against existing data [Ed: this is PreprocessTableCreation] and in some cases, Append has no validation because the table doesn’t exist. What isn’t clear is when these validations are applied. - Ryan: Without knowing exactly how v1 works, we can’t mirror that behavior in v2. Building a way to turn off validation is going to be needed, but is insufficient without knowing when to apply it. - Ryan: We also don’t know if it will make sense to maintain all of these rules to mimic v1 behavior. In v1, CTAS and Append can both write to existing tables, but use different rules to validate. What are the differences between them? It is unlikely that Spark will support both as options, if that is even possible. [Ed: see later discussion on migration that continues this.] - Gengliang: Using SaveMode is an option. - Ryan: Using SaveMode only appears to fix this, but doesn’t actually test v2. Using SaveMode appears to work because it disables all validation and uses code from v1 that will “create” tables by writing. But this isn’t helpful for the v2 goal of having defined and reliable behavior. - Gengliang: SaveMode is not correctly translated. Append could mean AppendData or CTAS. - Ryan: This is why we need to focus on finishing the v2 plans: so we can correctly translate the SaveMode into the right plan. That depends on having a catalog for CTAS and to check the existence of a table. - Wenchen: Catalog doesn’t support path tables, so how does this help? - Ryan: The multi-catalog identifiers proposal includes a way to pass paths as CatalogIdentifiers. [Ed: see PathIdentifier]. This allows a catalog implementation to handle path-based tables. The identifier will also have a method to test whether the identifier is a path identifier and catalogs are not required to support path identifiers. - Migration to v2 sources - Hyukjin: Once the ORC upgrade is done how will we move from v1 to v2? - Ryan: We will need to develop v1 and v2 in parallel. There are many code paths in v1 and we don’t know exactly what they do. We first need to know what they do and make a migration plan after that. - Hyukjin: What if there are many behavior differences? Will this require an API to opt in for each one? - Ryan: Without knowing how v1 behaves, we can only speculate. But I don’t think that we will want to support many of these special cases. That is a lot of work and maintenance. - Gengliang: When can we change the default to v2? Until we change the default, v2 is not tested. The v2 work is blocked by this. - Ryan: v2 work should not be