Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

Rui Wang Thu, 14 Jul 2022 14:04:48 -0700

There were some extra discussions that happened at
https://github.com/apache/spark/pull/37105.


As of now we agreed to have a "soft deprecation" that
1. document the limitation of four API and suggest to use alternatives in
the API doc
2. do not use the @deprecation annotation.

Please let us know if you don't agree.


-Rui

On Fri, Jul 8, 2022 at 11:18 AM Rui Wang <amaliu...@apache.org> wrote:

> Yes. The current goal is a pure educational deprecation.
>
> So given the proposal:
> 1. existing users or users who do not care about catalog names in table
> identifiers can still use all the API that maintain their past behavior.
> 2. new users who intend to use table identifiers with catalog names
> get warned by the annotation (and maybe a bit more comments on the API
> surface) that 4 API will not serve their usage.
>
> I believe this proposal is conservative: do not intend to cause troubles
> for existing users; do not intend to force user migration; do not intend to
> delete APIs; do not intend to hurt supportability. If there is anything
> that we can make this goal clear, I can do it.
>
> Ultimately, the 4 API in this thread has the problem that it is not
> compatible with 3 layer namespace thus not the same as other API who is
> supporting 3 layer namespace. For people who want to include catalog names,
> the problem itself will stand and we probably have to do something about
> it.
>
> -Rui
>
> On Fri, Jul 8, 2022 at 7:24 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> It's better to keep all APIs working. But in this case, I really have no
>> idea how to make these 4 APIs reasonable. For example, tableExists(dbName:
>> String, tableName: String) currently checks if table "dbName.tableName"
>> exists in the Hive metastore, and does not work with v2 catalogs at all.
>> It's not only a "not needed" API, but also a confusing API. We need a
>> mechanism to move users away from confusing APIs.
>>
>> I agree that we should not abuse deprecation. I think a general principle
>> to use deprecation is you have the intention to remove it eventually, which
>> is exactly the case here. We should remove these 4 APIs when most users
>> have moved away.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Thank you for starting the official discussion, Rui.
>>>
>>> 'Unneeded API' doesn't sound like a good frame for this discussion
>>> because it ignores the existing users and codes completely.
>>> Technically, the above mentioned reasons look irrelevant to any
>>> specific existing bugs or future maintenance cost saving. Instead, the
>>> deprecation already causes costs to the community (your PR, the future
>>> migration guide, and the communication with the customers like Q&A)
>>> and to the users for the actual migration to new API and validations.
>>> Given that, for now, the goal of this proposal looks like a pure
>>> educational purpose to advertise new APIs to Apache Spark 3.4+ users.
>>>
>>> Can we be more conservative at Apache Spark deprecation and allow
>>> users to use both APIs freely without any concern of uncertain
>>> insupportability? I simply want to avoid the situation where the pure
>>> educational deprecation itself becomes `Unneeded Deprecation` in the
>>> community.
>>>
>>> Dongjoon.
>>>
>>> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang <amaliu...@apache.org> wrote:
>>> >
>>> > I want to highlight in case I missed this in the original email:
>>> >
>>> > The 4 API will not be deleted. They will just be marked as deprecated
>>> annotations and we encourage users to use their alternatives.
>>> >
>>> >
>>> > -Rui
>>> >
>>> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang <amaliu...@apache.org> wrote:
>>> >>
>>> >> Hi Community,
>>> >>
>>> >> Proposal:
>>> >> I want to discuss a proposal to deprecate the following Catalog API:
>>> >> def listColumns(dbName: String, tableName: String): Dataset[Column]
>>> >> def getTable(dbName: String, tableName: String): Table
>>> >> def getFunction(dbName: String, functionName: String): Function
>>> >> def tableExists(dbName: String, tableName: String): Boolean
>>> >>
>>> >>
>>> >> Context:
>>> >> We have been adding table identifier with catalog name (aka 3 layer
>>> namespace) support to Catalog API in
>>> https://issues.apache.org/jira/browse/SPARK-39235.
>>> >> The basic idea is, if an API accepts:
>>> >> 1. only tableName:String, we allow it accepts "a.b.c" and goes
>>> analyzer which treats a as catalog name, b namespace name and c table name.
>>> >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer
>>> which treats a as catalog name, b namespace name.
>>> >> Meanwhile we still maintain the backwards compatibility for such API
>>> to make sure past behavior remains the same. E.g. If you only use tableName
>>> it is still recognized by the session catalog.
>>> >>
>>> >> With this effort ongoing, the above 4 API becomes not fully
>>> compatible with the 3 layer namespace.
>>> >>
>>> >> use tableExists(dbName: String, tableName: String) as an example,
>>> given that it takes two parameters but leaves no room for the extra catalog
>>> name. Also if we want to reuse the two parameters, which one will be the
>>> one that takes more than one name part?
>>> >>
>>> >>
>>> >> How?
>>> >> So how to improve the above 4 API? There are two options:
>>> >> a. Expand those four API to let those API accept catalog names. For
>>> example, tableExists(catalogName: String, dbName: String, tableName:
>>> String).
>>> >> b. mark those API as `deprecated`.
>>> >>
>>> >> I am proposing to follow option B which does API deprecation.
>>> >>
>>> >> Why?
>>> >> 1. Reduce unneeded API. The existing API can support the same
>>> behavior given SPARK-39235. For example, tableExists(dbName, tableName) can
>>> be replaced to use tableExists("dbName.tableName").
>>> >> 2. Reduce incomplete API. The proposed API to deprecate does not
>>> support 3 layer namespace now, and it is hard to do so (where to take 3
>>> part names)?
>>> >> 3. Deprecation suggests users to migrate their usage on API.
>>> >> 4. There was existing practice that we deprecated CreateExternalTable
>>> API when adding CreateTable API:
>>> https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220
>>> >>
>>> >>
>>> >> What do you think?
>>> >>
>>> >> Thanks,
>>> >> Rui Wang
>>> >>
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

Reply via email to