On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cloud0...@gmail.com> wrote:
> I don't think Hive compatibility itself is a "use case". > Ok let's add on top of this: I have some hive queries that I want to run on Spark. I believe that makes it a use case. > The Nessie <https://projectnessie.org/tools/hive/> example you mentioned > is a reasonable use case to me: some frameworks/applications want to create > external tables without user-specified location, so that they can manage > the table directory themselves and implement fancy features. > > That said, now I agree it's better to decouple EXTERNAL and LOCATION. We > should clearly document that, EXTERNAL and LOCATION are only applicable for > file-based data sources, and catalog implementation should fail if the > table has EXTERNAL or LOCATION property, but the table provider is not > file-based. > > BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as an > external table. Hive gives warning when you create managed tables with > custom location, which means this behavior is not recommended. Shall we > "infer" EXTERNAL from LOCATION although it's not Hive compatible? > > On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Wenchen, why are you ignoring Hive as a “reasonable use case”? >> >> The keyword came from Hive and we all agree that a Hive catalog with Hive >> behavior can’t be implemented if Spark chooses to couple this with >> LOCATION. Why is this use case not a justification? >> >> Also, the option to keep behavior the same as before is not mutually >> exclusive with passing EXTERNAL to catalogs. Spark can continue to have >> the same behavior in its catalog. But Spark cannot just choose to break >> compatibility with external systems by deciding when to fail certain >> combinations of DDL options. Choosing not to allow external without >> location when it is valid for Hive prevents building a compatible catalog. >> >> There are many reasons to build a Hive-compatible catalog. A great recent >> example is Nessie <https://projectnessie.org/tools/hive/>, which enables >> branching and tagging table states across several table formats and aims to >> be compatible with Hive. >> >> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> > As someone who's had the job of porting different SQL dialects to >>> Spark, I'm also very much in favor of keeping EXTERNAL >>> >>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 options >>> we are discussing are: >>> 1. Keep the behavior the same as before, i.e. EXTERNAL must co-exists >>> with LOCATION (or path option). >>> 2. Always allow EXTERNAL, and decouple it with LOCATION. >>> >>> I'm fine with option 2 if there are reasonable use cases. I think it's >>> always safer to keep the behavior the same as before. If we want to change >>> the behavior and follow option 2, we need use cases to justify it. >>> >>> For now, the only use case I see is for Hive compatibility and allow >>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use >>> cases we are targeting? >>> >>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <hol...@pigscanfly.ca> >>> wrote: >>> >>>> As someone who's had the job of porting different SQL dialects to >>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I think Ryan's >>>> suggestion of leaving it up to the catalogs on how to handle this makes >>>> sense. >>>> >>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid> >>>> wrote: >>>> >>>>> I would summarize both the problem and the current state differently. >>>>> >>>>> Currently, Spark parses the EXTERNAL keyword for compatibility with >>>>> Hive SQL, but Spark’s built-in catalog doesn’t allow creating a table with >>>>> EXTERNAL unless LOCATION is also present. *This “hidden feature” >>>>> breaks compatibility with Hive SQL* because all combinations of >>>>> EXTERNAL and LOCATION are valid in Hive, but creating an external >>>>> table with a default location is not allowed by Spark. Note that Spark >>>>> must >>>>> still handle these tables because it shares a metastore with Hive, which >>>>> can still create them. >>>>> >>>>> Now catalogs can be plugged in, the question is whether to pass the >>>>> fact that EXTERNAL was in the CREATE TABLE statement to the v2 >>>>> catalog handling a create command, or to suppress it and apply Spark’s >>>>> rule >>>>> that LOCATION must be present. >>>>> >>>>> If it is not passed to the catalog, then a Hive catalog cannot >>>>> implement the behavior of Hive SQL, even though Spark added the keyword >>>>> for >>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL however >>>>> Spark chooses to, but I think it is a poor choice to force different >>>>> behavior on other catalogs. >>>>> >>>>> Wenchen has also argued that the purpose of this is to standardize >>>>> behavior across catalogs. But hiding EXTERNAL would not accomplish >>>>> that goal. Whether to physically delete data is a choice that is up to the >>>>> catalog. Some catalogs have no “external” concept and will always drop >>>>> data >>>>> when a table is dropped. The ability to keep underlying data files is >>>>> specific to a few catalogs, and whether that is controlled by EXTERNAL, >>>>> the LOCATION clause, or something else is still up to the catalog >>>>> implementation. >>>>> >>>>> I don’t think that there is a good reason to force catalogs to break >>>>> compatibility with Hive SQL, while making it appear as though DDL is >>>>> compatible. Because removing EXTERNAL would be a breaking change to >>>>> the SQL parser, I think the best option is to pass it to v2 catalogs so >>>>> the >>>>> catalog can decide how to handle it. >>>>> >>>>> rb >>>>> >>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I'd like to start a discussion thread about this topic, as it blocks >>>>>> an important feature that we target for Spark 3.1: unify the CREATE TABLE >>>>>> SQL syntax. >>>>>> >>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a >>>>>> hidden feature in Spark for Hive compatibility. >>>>>> >>>>>> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL >>>>>> TABLE ... USING parquet`, the parser fails and tells you that >>>>>> EXTERNAL can't be specified. >>>>>> >>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified >>>>>> if LOCATION clause or path option is present. For example, `CREATE >>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is no >>>>>> LOCATION clause or path option. This is not 100% Hive compatible. >>>>>> >>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is how to >>>>>> deal with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it >>>>>> was, or we can officially support it. >>>>>> >>>>>> Please let us know your thoughts: >>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? >>>>>> Have you used it in production before? For what use cases? >>>>>> 2. As a catalog developer, how are you going to implement EXTERNAL >>>>>> TABLE? It seems to me that it only makes sense for file source, as the >>>>>> table directory can be managed. I'm not sure how to interpret EXTERNAL in >>>>>> catalogs like jdbc, cassandra, etc. >>>>>> >>>>>> For more details, please refer to the long discussion in >>>>>> https://github.com/apache/spark/pull/28026 >>>>>> >>>>>> Thanks, >>>>>> Wenchen >>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau