I would summarize both the problem and the current state differently.

Currently, Spark parses the EXTERNAL keyword for compatibility with Hive
SQL, but Spark’s built-in catalog doesn’t allow creating a table with
EXTERNAL unless LOCATION is also present. *This “hidden feature” breaks
compatibility with Hive SQL* because all combinations of EXTERNAL and
LOCATION are valid in Hive, but creating an external table with a default
location is not allowed by Spark. Note that Spark must still handle these
tables because it shares a metastore with Hive, which can still create them.

Now catalogs can be plugged in, the question is whether to pass the fact
that EXTERNAL was in the CREATE TABLE statement to the v2 catalog handling
a create command, or to suppress it and apply Spark’s rule that LOCATION
must be present.

If it is not passed to the catalog, then a Hive catalog cannot implement
the behavior of Hive SQL, even though Spark added the keyword for Hive
compatibility. The Spark catalog can interpret EXTERNAL however Spark
chooses to, but I think it is a poor choice to force different behavior on
other catalogs.

Wenchen has also argued that the purpose of this is to standardize behavior
across catalogs. But hiding EXTERNAL would not accomplish that goal.
Whether to physically delete data is a choice that is up to the catalog.
Some catalogs have no “external” concept and will always drop data when a
table is dropped. The ability to keep underlying data files is specific to
a few catalogs, and whether that is controlled by EXTERNAL, the LOCATION
clause, or something else is still up to the catalog implementation.

I don’t think that there is a good reason to force catalogs to break
compatibility with Hive SQL, while making it appear as though DDL is
compatible. Because removing EXTERNAL would be a breaking change to the SQL
parser, I think the best option is to pass it to v2 catalogs so the catalog
can decide how to handle it.

rb

On Tue, Oct 6, 2020 at 7:06 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> Hi all,
>
> I'd like to start a discussion thread about this topic, as it blocks an
> important feature that we target for Spark 3.1: unify the CREATE TABLE SQL
> syntax.
>
> A bit more background for CREATE EXTERNAL TABLE: it's kind of a hidden
> feature in Spark for Hive compatibility.
>
> When you write native CREATE TABLE syntax such as `CREATE EXTERNAL TABLE
> ... USING parquet`, the parser fails and tells you that EXTERNAL can't be
> specified.
>
> When we write Hive CREATE TABLE syntax, the EXTERNAL can be specified if
> LOCATION clause or path option is present. For example, `CREATE EXTERNAL
> TABLE ... STORED AS parquet` is not allowed as there is no LOCATION
> clause or path option. This is not 100% Hive compatible.
>
> As we are unifying the CREATE TABLE SQL syntax, one problem is how to deal
> with CREATE EXTERNAL TABLE. We can keep it as a hidden feature as it was,
> or we can officially support it.
>
> Please let us know your thoughts:
> 1. As an end-user, what do you expect CREATE EXTERNAL TABLE to do? Have
> you used it in production before? For what use cases?
> 2. As a catalog developer, how are you going to implement EXTERNAL TABLE?
> It seems to me that it only makes sense for file source, as the table
> directory can be managed. I'm not sure how to interpret EXTERNAL in
> catalogs like jdbc, cassandra, etc.
>
> For more details, please refer to the long discussion in
> https://github.com/apache/spark/pull/28026
>
> Thanks,
> Wenchen
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to