I don’t think Spark ever claims to be 100% Hive compatible. I just found some relevant documentation <https://docs.databricks.com/spark/latest/spark-sql/compatibility/hive.html#apache-hive-compatibility> on this, where Databricks claims that “Apache Spark SQL in Databricks is designed to be compatible with the Apache Hive, including metastore connectivity, SerDes, and UDFs.”
There is a strong expectation that Spark is compatible with Hive, and the Spark community has made the claim. That isn’t a claim of 100% compatibility, but no one suggested that there was 100% compatibility. On Wed, Oct 7, 2020 at 11:54 AM Ryan Blue <rb...@netflix.com> wrote: > I don’t think Spark ever claims to be 100% Hive compatible. > > By accepting the EXTERNAL keyword in some circumstances, Spark is > providing compatibility with Hive DDL. Yes, there are places where it > breaks. The question is whether we should deliberately break what a Hive > catalog could implement, when we know what Hive’s behavior is. > > CREATE EXTERNAL TABLE is not a Hive-specific feature > > Great. So there are other catalogs that could use it. Why should Spark > choose to limit Hive’s interpretation of this keyword? > > While it is great that we seem to agree that Spark shouldn’t do this — now > that Nessie was pointed out — I’m concerned that you still seem to think > this is a choice that Spark could reasonably make. *Spark cannot > arbitrarily choose how to interpret DDL for an external catalog*. > > You may not consider this arbitrary because there are other examples where > location is required. But the Hive community made the choice that these > clauses are orthogonal, so it is clearly a choice of the external system, > and it is not Spark’s role to dictate how an external database should > behave. > > I think the Nessie use case is good enough to justify the decoupling of > EXTERNAL and LOCATION. > > It appears that we have consensus. This will be passed to catalogs, which > can implement the behavior that they choose. > > On Wed, Oct 7, 2020 at 11:08 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> > I have some hive queries that I want to run on Spark. >> >> Spark is not compatible with Hive in many places. Decoupling EXTERNAL and >> LOCATION can't help you too much here. If you do have this use case, we >> need a much wider discussion about how to achieve it. >> >> For this particular topic, we need concrete use cases like Nessie >> <https://projectnessie.org/tools/hive/>. It will be great to see more >> concrete use cases, but I think the Nessie use case is good enough to >> justify the decoupling of EXTERNAL and LOCATION. >> >> BTW, CREATE EXTERNAL TABLE is not a Hive-specific feature. Many databases >> have it. That's why I think Hive-compatibility alone is not a reasonable >> use case. For your reference: >> 1. Snowflake supports CREATE EXTERNAL TABLE and requires the LOCATION >> clause as Spark does: doc >> <https://docs.snowflake.com/en/sql-reference/sql/create-external-table.html> >> 2. Redshift supports CREATE EXTERNAL TABLE and requires the LOCATION >> clause as Spark does: doc >> <https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html> >> 3. Db2 supports CREATE EXTERNAL TABLE and requires DATAOBJECT or >> FILE_NAME option: doc >> <https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r_create_ext_table.html> >> 4. SQL Server also supports CREATE EXTERNAL TABLE: doc >> <https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15> >> >> > with which Spark claims to be compatible >> >> I don't think Spark ever claims to be 100% Hive compatible. In fact, we >> diverged from Hive intentionally in several places, where we think the Hive >> behavior was not reasonable and we shouldn't follow it. >> >> On Thu, Oct 8, 2020 at 1:58 AM Ryan Blue <rb...@netflix.com> wrote: >> >>> how about LOCATION without EXTERNAL? Currently Spark treats it as an >>> external table. >>> >>> I think there is some confusion about what Spark has to handle. >>> Regardless of what Spark allows as DDL, these tables can exist in a Hive >>> MetaStore that Spark connects to, and the general expectation is that Spark >>> doesn’t change the meaning of table configuration. There are notable bugs >>> where Spark has different behavior, but that is the expectation. >>> >>> In this particular case, we’re talking about what can be expressed in >>> DDL that is sent to an external catalog. Spark could (unwisely) choose to >>> disallow some DDL combinations, but the table is implemented through a >>> plugin so the interpretation is up to the plugin. Spark has no role in >>> choosing how to treat this table, unless it is loaded through Spark’s >>> built-in catalog; in which case, see above. >>> >>> I don’t think Hive compatibility itself is a “use case”. >>> >>> Why? >>> >>> Hive is an external database that defines its own behavior and with >>> which Spark claims to be compatible. If Hive isn’t a valid use case, then >>> why is EXTERNAL supported at all? >>> >>> On Wed, Oct 7, 2020 at 10:17 AM Holden Karau <hol...@pigscanfly.ca> >>> wrote: >>> >>>> >>>> >>>> On Wed, Oct 7, 2020 at 9:57 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>>> >>>>> I don't think Hive compatibility itself is a "use case". >>>>> >>>> Ok let's add on top of this: I have some hive queries that I want to >>>> run on Spark. I believe that makes it a use case. >>>> >>>>> The Nessie <https://projectnessie.org/tools/hive/> example you >>>>> mentioned is a reasonable use case to me: some frameworks/applications >>>>> want >>>>> to create external tables without user-specified location, so that they >>>>> can >>>>> manage the table directory themselves and implement fancy features. >>>>> >>>>> That said, now I agree it's better to decouple EXTERNAL and LOCATION. >>>>> We should clearly document that, EXTERNAL and LOCATION are only applicable >>>>> for file-based data sources, and catalog implementation should fail if the >>>>> table has EXTERNAL or LOCATION property, but the table provider is not >>>>> file-based. >>>>> >>>>> BTW, how about LOCATION without EXTERNAL? Currently Spark treats it as >>>>> an external table. Hive gives warning when you create managed tables with >>>>> custom location, which means this behavior is not recommended. Shall we >>>>> "infer" EXTERNAL from LOCATION although it's not Hive compatible? >>>>> >>>>> On Thu, Oct 8, 2020 at 12:24 AM Ryan Blue <rb...@netflix.com.invalid> >>>>> wrote: >>>>> >>>>>> Wenchen, why are you ignoring Hive as a “reasonable use case”? >>>>>> >>>>>> The keyword came from Hive and we all agree that a Hive catalog with >>>>>> Hive behavior can’t be implemented if Spark chooses to couple this with >>>>>> LOCATION. Why is this use case not a justification? >>>>>> >>>>>> Also, the option to keep behavior the same as before is not mutually >>>>>> exclusive with passing EXTERNAL to catalogs. Spark can continue to >>>>>> have the same behavior in its catalog. But Spark cannot just choose to >>>>>> break compatibility with external systems by deciding when to fail >>>>>> certain >>>>>> combinations of DDL options. Choosing not to allow external without >>>>>> location when it is valid for Hive prevents building a compatible >>>>>> catalog. >>>>>> >>>>>> There are many reasons to build a Hive-compatible catalog. A great >>>>>> recent example is Nessie <https://projectnessie.org/tools/hive/>, >>>>>> which enables branching and tagging table states across several table >>>>>> formats and aims to be compatible with Hive. >>>>>> >>>>>> On Wed, Oct 7, 2020 at 5:51 AM Wenchen Fan <cloud0...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> > As someone who's had the job of porting different SQL dialects to >>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL >>>>>>> >>>>>>> Just to be clear: no one is proposing to remove EXTERNAL. The 2 >>>>>>> options we are discussing are: >>>>>>> 1. Keep the behavior the same as before, i.e. EXTERNAL must >>>>>>> co-exists with LOCATION (or path option). >>>>>>> 2. Always allow EXTERNAL, and decouple it with LOCATION. >>>>>>> >>>>>>> I'm fine with option 2 if there are reasonable use cases. I think >>>>>>> it's always safer to keep the behavior the same as before. If we want to >>>>>>> change the behavior and follow option 2, we need use cases to justify >>>>>>> it. >>>>>>> >>>>>>> For now, the only use case I see is for Hive compatibility and allow >>>>>>> EXTERNAL TABLE without user-specified LOCATION. Are there any more use >>>>>>> cases we are targeting? >>>>>>> >>>>>>> On Wed, Oct 7, 2020 at 5:06 AM Holden Karau <hol...@pigscanfly.ca> >>>>>>> wrote: >>>>>>> >>>>>>>> As someone who's had the job of porting different SQL dialects to >>>>>>>> Spark, I'm also very much in favor of keeping EXTERNAL, and I think >>>>>>>> Ryan's >>>>>>>> suggestion of leaving it up to the catalogs on how to handle this makes >>>>>>>> sense. >>>>>>>> >>>>>>>> On Tue, Oct 6, 2020 at 1:54 PM Ryan Blue <rb...@netflix.com.invalid> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I would summarize both the problem and the current state >>>>>>>>> differently. >>>>>>>>> >>>>>>>>> Currently, Spark parses the EXTERNAL keyword for compatibility >>>>>>>>> with Hive SQL, but Spark’s built-in catalog doesn’t allow creating a >>>>>>>>> table >>>>>>>>> with EXTERNAL unless LOCATION is also present. *This “hidden >>>>>>>>> feature” breaks compatibility with Hive SQL* because all >>>>>>>>> combinations of EXTERNAL and LOCATION are valid in Hive, but >>>>>>>>> creating an external table with a default location is not allowed by >>>>>>>>> Spark. >>>>>>>>> Note that Spark must still handle these tables because it shares a >>>>>>>>> metastore with Hive, which can still create them. >>>>>>>>> >>>>>>>>> Now catalogs can be plugged in, the question is whether to pass >>>>>>>>> the fact that EXTERNAL was in the CREATE TABLE statement to the >>>>>>>>> v2 catalog handling a create command, or to suppress it and apply >>>>>>>>> Spark’s >>>>>>>>> rule that LOCATION must be present. >>>>>>>>> >>>>>>>>> If it is not passed to the catalog, then a Hive catalog cannot >>>>>>>>> implement the behavior of Hive SQL, even though Spark added the >>>>>>>>> keyword for >>>>>>>>> Hive compatibility. The Spark catalog can interpret EXTERNAL >>>>>>>>> however Spark chooses to, but I think it is a poor choice to force >>>>>>>>> different behavior on other catalogs. >>>>>>>>> >>>>>>>>> Wenchen has also argued that the purpose of this is to standardize >>>>>>>>> behavior across catalogs. But hiding EXTERNAL would not >>>>>>>>> accomplish that goal. Whether to physically delete data is a choice >>>>>>>>> that is >>>>>>>>> up to the catalog. Some catalogs have no “external” concept and will >>>>>>>>> always >>>>>>>>> drop data when a table is dropped. The ability to keep underlying data >>>>>>>>> files is specific to a few catalogs, and whether that is controlled by >>>>>>>>> EXTERNAL, the LOCATION clause, or something else is still up to >>>>>>>>> the catalog implementation. >>>>>>>>> >>>>>>>>> I don’t think that there is a good reason to force catalogs to >>>>>>>>> break compatibility with Hive SQL, while making it appear as though >>>>>>>>> DDL is >>>>>>>>> compatible. Because removing EXTERNAL would be a breaking change >>>>>>>>> to the SQL parser, I think the best option is to pass it to v2 >>>>>>>>> catalogs so >>>>>>>>> the catalog can decide how to handle it. >>>>>>>>> >>>>>>>>> rb >>>>>>>>> >>>>>>>>> On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I'd like to start a discussion thread about this topic, as it >>>>>>>>>> blocks an important feature that we target for Spark 3.1: unify the >>>>>>>>>> CREATE >>>>>>>>>> TABLE SQL syntax. >>>>>>>>>> >>>>>>>>>> A bit more background for CREATE EXTERNAL TABLE: it's kind of a >>>>>>>>>> hidden feature in Spark for Hive compatibility. >>>>>>>>>> >>>>>>>>>> When you write native CREATE TABLE syntax such as `CREATE >>>>>>>>>> EXTERNAL TABLE ... USING parquet`, the parser fails and tells >>>>>>>>>> you that EXTERNAL can't be specified. >>>>>>>>>> >>>>>>>>>> When we write Hive CREATE TABLE syntax, the EXTERNAL can be >>>>>>>>>> specified if LOCATION clause or path option is present. For example, >>>>>>>>>> `CREATE >>>>>>>>>> EXTERNAL TABLE ... STORED AS parquet` is not allowed as there is >>>>>>>>>> no LOCATION clause or path option. This is not 100% Hive compatible. >>>>>>>>>> >>>>>>>>>> As we are unifying the CREATE TABLE SQL syntax, one problem is >>>>>>>>>> how to deal with CREATE EXTERNAL TABLE. We can keep it as a hidden >>>>>>>>>> feature >>>>>>>>>> as it was, or we can officially support it. >>>>>>>>>> >>>>>>>>>> Please let us know your thoughts: >>>>>>>>>> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to >>>>>>>>>> do? Have you used it in production before? For what use cases? >>>>>>>>>> 2. As a catalog developer, how are you going to implement >>>>>>>>>> EXTERNAL TABLE? It seems to me that it only makes sense for file >>>>>>>>>> source, as >>>>>>>>>> the table directory can be managed. I'm not sure how to interpret >>>>>>>>>> EXTERNAL >>>>>>>>>> in catalogs like jdbc, cassandra, etc. >>>>>>>>>> >>>>>>>>>> For more details, please refer to the long discussion in >>>>>>>>>> https://github.com/apache/spark/pull/28026 >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Wenchen >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix