kosiew commented on issue #1142:
URL: 
https://github.com/apache/datafusion-python/issues/1142#issuecomment-2969114420

   
   
   ## Context & Problem Statement
   
   - **Current state**  
     - **Datafusion core repo:** uses `catalog/schema/table`  
     - **Datafusion Python repo:** uses `catalog/database/table`  
     - Other usages: `catalog/namespace/table`
   
   - **Problem**  
     - The inconsistent terminology leads to confusion for users and 
contributors, especially as the API matures and more complex catalog operations 
become common.  
     - There is a need for a consistent 3-level hierarchical naming scheme 
across all interfaces and repositories.
   
   - **Why does this matter?**  
     - Clear, consistent naming improves user understanding and reduces errors. 
 
     - Aligning the semantics allows better interoperability and documentation 
clarity.  
     - Since many users presently might only have one 
schema/database/namespace, now is a good time to change before adoption widens.
   
   ---
   
   ## Key Terms and Their Semantic Meanings
   
   | Term       | Typical Meaning in Databases                                 |
   |------------|-------------------------------------------------------------|
   | **Catalog**| The highest-level grouping, often corresponds to a data 
source or cluster. |
   | **Schema** | A logical grouping/container within a catalog, often 
corresponding to namespaces for tables (e.g., Postgres schema). |
   | **Database**| Sometimes synonymous with catalog (e.g., MySQL: the database 
is a catalog), other times a level. |
   | **Namespace**| A more generic term representing a logical scope that 
contains tables; can be interchangeable with schema or database depending on 
system. |
   | **Table**  | The actual table or data object.                             |
   
   ---
   
   ## Exploration of the Three Naming Variants
   
   ### 1. `catalog/schema/table`
   
   - **Pros**  
     - Matches Datafusion core repo convention, supporting consistency in the 
core project.  
     - Matching widespread SQL semantic usage, e.g., Oracle/Postgres where 
schema = namespace under catalog.  
     - Clear semantic distinction: catalog as the source, schema as logical 
grouping.
   
   - **Cons**  
     - `schema` term might confuse users coming from MySQL or systems where 
schema=database.  
     - In some systems, "database" is the term used instead.
   
   ### 2. `catalog/database/table`
   
   - **Pros**  
     - Familiar to users from MySQL, BigQuery, and others that treat "database" 
as the middle layer.  
     - More intuitive for newcomers who think in terms of databases rather than 
schemas.
   
   - **Cons**  
     - Conflicts with Datafusion core (which prefers “schema”).  
     - "Database" and "catalog" meanings overlap in different systems, risking 
ambiguity.
   
   ### 3. `catalog/namespace/table`
   
   - **Pros**  
     - Namespace is generic and can adapt to any system (equivalent to schema 
or database).  
     - Avoids confusion by not tying to concrete DBMS terminology.  
     - Aligns with abstraction in distributed systems and catalog APIs.
   
   - **Cons**  
     - Less immediately familiar to SQL users.  
     - Could add cognitive overhead if users expect more standard terms.
   
   ---
   
   ## Diagram: How these map onto a conceptual hierarchy
   
   ```plaintext
   +---------------------------+
   |        Catalog            |
   | +-----------------------+ |
   | | Schema / Database /   | |
   | | Namespace             | |
   | | +-------------------+ | |
   | | |       Table       | | |
   | | +-------------------+ | |
   | +-----------------------+ |
   +---------------------------+
   ```
   
   - The middle layer is the point of ambiguity: schema / database / namespace.
   
   ---
   
   ## Recommendations
   
   ### Align with Datafusion Core
   
   - Since the **core repo uses `catalog/schema/table`**, and Datafusion is the 
source of truth, **standardizing on `catalog/schema/table`** is recommended to 
reduce cognitive dissonance.
   
   ### Provide Alias or Flexibility in Python Bindings
   
   - For the Python API and other language bindings, consider exposing aliases 
or conversion helpers so the user can think in terms of databases or namespaces 
if desired.  
   - Documentation should clearly indicate what "schema" means in Datafusion 
parlance.
   
   ### Consider Long-Term Evolution
   
   - If the ecosystem grows to support multiple systems with conflicting 
terminologies, *consider introducing a configurable abstraction layer* to map 
terms more flexibly. For now, keep it simple.
   
   ---
   
   ## Summary of User Impact
   
   | User Scenario                   | Impact of Change to 
`catalog/schema/table`         |
   
|--------------------------------|----------------------------------------------------|
   | Single-schema users             | Minimal to no impact; mostly transparent 
          |
   | Multi-schema advanced users     | Gains clarity and consistency            
           |
   | Users from MySQL-style systems  | Need to adapt terminology slightly, but 
this is common in cross-platform tools |
   | Documentation and tooling       | Greater consistency and clarity          
            |
   
   ---
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to