Thanks, Sung for the great writeup and explaining the context.
I am +1 to remove the catalog name as part of the table identifier. In PyIceberg, each instance of a catalog has an associated name. So any functions of the catalog instance must be related to that catalog name. In the case where a single REST endpoint can represent many different catalog instances, we can use multiple instances of the catalog object. This is similar to Trino/Spark’s `USE <catalog>` feature. The only potential issue I can see is if, for some reason, we would want to refer to multiple catalogs in the same statement, i.e. “SELECT * FROM catalog1.foo.bar join catalog2.foo.bar USING col_a”. But I don’t see this being used in any of the current functions. Regarding the deprecation, I think it’ll be good to do it in multiple steps. Perhaps we can start with a WARNING log whenever a catalog name is used while allowing table identifiers both with and without the catalog name. Then, in a future release, we can completely remove the catalog name. Thanks, Kevin Liu On Wed, Jul 31, 2024 at 2:56 PM Sung Yun <sungwy...@gmail.com> wrote: > Today in PyIceberg, we have support for identifier parsing in public APIs > belonging to two different classes: > > > - Catalog class: load_table, purge_table, drop_table > - Table class: scan > > > These APIs currently have optional support for the identifier that the > instance itself belongs to. > > For example, the catalog class APIs support: > > > *catalog = load_catalog(“animals”, > **properties)catalog.load_table(“cats.whiskers”)* > > But it also supports: > > *catalog.load_table(“animals.cats.whiskers”)* > > Which is redundant, because the catalog.name is already “animals”. > > Similarly, row_filter in the Table scan API supports: > > > *table = catalog.load_table(“cats.whiskers”)table.scan(row_filter=”n_legs > == 4”)* > > But we also support > > > *table.scan(row_filter=”whiskers.n_legs == 4”)* > Which is also redundant, because the table name is already “whiskers” (or > cats.whiskers) > > While it sounds like a good change, I’d still like to open this thread to > discuss the possibility of removing this optional support for the > instance-level identifier as it will result in a backward incompatible API > behavior change. > > The benefits of this change are as follows: > > - As observed above, specifying instance-level identifier in these > APIs is redundant > - This optional support adds a lot of complexity to the code base and > leads to issues like: #742 > <https://github.com/apache/iceberg-python/issues/742> It would be > really great to clean this up before as we prepare for a 1.0.0 later this > year > - The optional support opens up the possibility of resulting in > correctness issues if there exists a name in the level below as the > instance-level identifier. > - For example, if in the above catalog, we have a table namespace > named “animals.lower” catalog.load(“animals.lower.cats”) can be > construed > as table name “cats” in the namespace “animals.lower” but it will be > interpreted as table name “cats” in the namespace “lower” which is > erroneous. > - We would see a similar issue with tables and field names as well. > Field name parsing is already complicated because we have to represent > nested fields as flat representations. So it would be great to remove > one > unnecessary level of complication here > > > I'd love to hear from the community on their thoughts on this topic. If > there are any folks in the community using the optional feature, it would > be especially great to hear from you as well, on what this change will mean > for your applications. > > Related PR: #963 <https://github.com/apache/iceberg-python/pull/963> > > Sung >