kosiew commented on issue #1142: URL: https://github.com/apache/datafusion-python/issues/1142#issuecomment-2969114420
## Context & Problem Statement - **Current state** - **Datafusion core repo:** uses `catalog/schema/table` - **Datafusion Python repo:** uses `catalog/database/table` - Other usages: `catalog/namespace/table` - **Problem** - The inconsistent terminology leads to confusion for users and contributors, especially as the API matures and more complex catalog operations become common. - There is a need for a consistent 3-level hierarchical naming scheme across all interfaces and repositories. - **Why does this matter?** - Clear, consistent naming improves user understanding and reduces errors. - Aligning the semantics allows better interoperability and documentation clarity. - Since many users presently might only have one schema/database/namespace, now is a good time to change before adoption widens. --- ## Key Terms and Their Semantic Meanings | Term | Typical Meaning in Databases | |------------|-------------------------------------------------------------| | **Catalog**| The highest-level grouping, often corresponds to a data source or cluster. | | **Schema** | A logical grouping/container within a catalog, often corresponding to namespaces for tables (e.g., Postgres schema). | | **Database**| Sometimes synonymous with catalog (e.g., MySQL: the database is a catalog), other times a level. | | **Namespace**| A more generic term representing a logical scope that contains tables; can be interchangeable with schema or database depending on system. | | **Table** | The actual table or data object. | --- ## Exploration of the Three Naming Variants ### 1. `catalog/schema/table` - **Pros** - Matches Datafusion core repo convention, supporting consistency in the core project. - Matching widespread SQL semantic usage, e.g., Oracle/Postgres where schema = namespace under catalog. - Clear semantic distinction: catalog as the source, schema as logical grouping. - **Cons** - `schema` term might confuse users coming from MySQL or systems where schema=database. - In some systems, "database" is the term used instead. ### 2. `catalog/database/table` - **Pros** - Familiar to users from MySQL, BigQuery, and others that treat "database" as the middle layer. - More intuitive for newcomers who think in terms of databases rather than schemas. - **Cons** - Conflicts with Datafusion core (which prefers “schema”). - "Database" and "catalog" meanings overlap in different systems, risking ambiguity. ### 3. `catalog/namespace/table` - **Pros** - Namespace is generic and can adapt to any system (equivalent to schema or database). - Avoids confusion by not tying to concrete DBMS terminology. - Aligns with abstraction in distributed systems and catalog APIs. - **Cons** - Less immediately familiar to SQL users. - Could add cognitive overhead if users expect more standard terms. --- ## Diagram: How these map onto a conceptual hierarchy ```plaintext +---------------------------+ | Catalog | | +-----------------------+ | | | Schema / Database / | | | | Namespace | | | | +-------------------+ | | | | | Table | | | | | +-------------------+ | | | +-----------------------+ | +---------------------------+ ``` - The middle layer is the point of ambiguity: schema / database / namespace. --- ## Recommendations ### Align with Datafusion Core - Since the **core repo uses `catalog/schema/table`**, and Datafusion is the source of truth, **standardizing on `catalog/schema/table`** is recommended to reduce cognitive dissonance. ### Provide Alias or Flexibility in Python Bindings - For the Python API and other language bindings, consider exposing aliases or conversion helpers so the user can think in terms of databases or namespaces if desired. - Documentation should clearly indicate what "schema" means in Datafusion parlance. ### Consider Long-Term Evolution - If the ecosystem grows to support multiple systems with conflicting terminologies, *consider introducing a configurable abstraction layer* to map terms more flexibly. For now, keep it simple. --- ## Summary of User Impact | User Scenario | Impact of Change to `catalog/schema/table` | |--------------------------------|----------------------------------------------------| | Single-schema users | Minimal to no impact; mostly transparent | | Multi-schema advanced users | Gains clarity and consistency | | Users from MySQL-style systems | Need to adapt terminology slightly, but this is common in cross-platform tools | | Documentation and tooling | Greater consistency and clarity | --- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org