Re: [DISCUSS] PyIceberg: Remove optional support for instance-level identifier in Catalog and Table APIs

Kevin Liu Tue, 06 Aug 2024 12:37:45 -0700

Thanks, Sung for the great writeup and explaining the context.


I am +1 to remove the catalog name as part of the table identifier. In
PyIceberg, each instance of a catalog has an associated name. So any
functions of the catalog instance must be related to that catalog name. In
the case where a single REST endpoint can represent many different catalog
instances, we can use multiple instances of the catalog object. This is
similar to Trino/Spark’s `USE <catalog>` feature.


The only potential issue I can see is if, for some reason, we would want to
refer to multiple catalogs in the same statement, i.e. “SELECT * FROM
catalog1.foo.bar join catalog2.foo.bar USING col_a”. But I don’t see this
being used in any of the current functions.


Regarding the deprecation, I think it’ll be good to do it in multiple
steps. Perhaps we can start with a WARNING log whenever a catalog name is
used while allowing table identifiers both with and without the catalog
name. Then, in a future release, we can completely remove the catalog name.


Thanks,

Kevin Liu

On Wed, Jul 31, 2024 at 2:56 PM Sung Yun <sungwy...@gmail.com> wrote:

> Today in PyIceberg, we have support for identifier parsing in public APIs
> belonging to two different classes:
>
>
>    - Catalog class: load_table, purge_table, drop_table
>    - Table class: scan
>
>
> These APIs currently have optional support for the identifier that the
> instance itself belongs to.
>
> For example, the catalog class APIs support:
>
>
> *catalog = load_catalog(“animals”,
> **properties)catalog.load_table(“cats.whiskers”)*
>
> But it also supports:
>
> *catalog.load_table(“animals.cats.whiskers”)*
>
> Which is redundant, because the catalog.name is already “animals”.
>
> Similarly, row_filter in the Table scan API supports:
>
>
> *table = catalog.load_table(“cats.whiskers”)table.scan(row_filter=”n_legs
> == 4”)*
>
> But we also support
>
>
> *table.scan(row_filter=”whiskers.n_legs == 4”)*
> Which is also redundant, because the table name is already “whiskers” (or
> cats.whiskers)
>
> While it sounds like a good change, I’d still like to open this thread to
> discuss the possibility of removing this optional support for the
> instance-level identifier as it will result in a backward incompatible API
> behavior change.
>
> The benefits of this change are as follows:
>
>    - As observed above, specifying instance-level identifier in these
>    APIs is redundant
>    - This optional support adds a lot of complexity to the code base and
>    leads to issues like: #742
>    <https://github.com/apache/iceberg-python/issues/742> It would be
>    really great to clean this up before as we prepare for a 1.0.0 later this
>    year
>    - The optional support opens up the possibility of resulting in
>    correctness issues if there exists a name in the level below as the
>    instance-level identifier.
>       - For example, if in the above catalog, we have a table namespace
>       named “animals.lower” catalog.load(“animals.lower.cats”) can be 
> construed
>       as table name “cats” in the namespace “animals.lower” but it will be
>       interpreted as table name “cats” in the namespace “lower” which is
>       erroneous.
>       - We would see a similar issue with tables and field names as well.
>       Field name parsing is already complicated because we have to represent
>       nested fields as flat representations. So it would be great to remove 
> one
>       unnecessary level of complication here
>
>
> I'd love to hear from the community on their thoughts on this topic. If
> there are any folks in the community using the optional feature, it would
> be especially great to hear from you as well, on what this change will mean
> for your applications.
>
> Related PR: #963 <https://github.com/apache/iceberg-python/pull/963>
>
> Sung
>

Re: [DISCUSS] PyIceberg: Remove optional support for instance-level identifier in Catalog and Table APIs

Reply via email to