Hey everyone,
Building on the previous thread
<https://lists.apache.org/thread/tv3405nx6zpbm6cxbo71yygf8s9sbj6m>
regarding Catalogs in Beam, @Talat Uyarer <[email protected]> and I noticed
several areas where Beam SQL's usability could be significantly improved,
particularly concerning its interaction with existing tables and its
metadata management.
Some gaps we see currently:
- Lack of a DATABASE concept (analogous to BigQuery datasets or Iceberg
namespaces)
- Users are required to execute a redundant CREATE TABLE statement when
reading from a table that already exists
- Beam requires the table name/path to be specified in the LOCATION
property, when. it could be inferred from the reference name in CREATE
TABLE <name>. For example, a user would need to do something like CREATE
TABLE foo.bar(...) LOCATION 'foo.bar'. LOCATION may be necessary for
some IOs like Kafka or Pubsub, but is redundant for others.
- Missing support for SHOW statements, which are crucial for
discoverability. e.g.:
- SHOW CATALOGS
- SHOW CURRENT CATALOG
- SHOW DATABASES FROM catalog_name LIKE 'pay*'
- SHOW CURRENT DATABASE
- SHOW TABLES FROM catalog_name.database_name NOT LIKE '*foo'
- Missing support for ALTER statements, which is important for table
schema manipulation or catalog modification. e.g.:
- ALTER CATALOG my_catalog SET ('foo_property' = 'bar')
- ALTER TABLE my_table ADD (col1 INTEGER, col2 TIMESTAMP)
I've created a Github issue to track these points: #35637
<https://github.com/apache/beam/issues/35637>. Our initial focus is on
enhancing the experience for Iceberg users within Beam SQL, but this should
benefit broader Beam SQL usage as well. Please take a look, and if you
identify any other crucial gaps or have suggestions, feel free to comment
there or reply to this thread.
Thanks,
Ahmed