iconara commented on issue #1704:
URL: https://github.com/apache/arrow-adbc/issues/1704#issuecomment-4743374190

   I've been looking at `GetObjects` from the point of view of implementing it 
for Amazon Athena. Athena does not have a catalog of its own, but uses the AWS 
Glue Data Catalog. There are many mismatches between ADBC's `GetObjects` and 
the reality of this catalog unfortunately:
   
   **Data lake catalogs can be huge.** I've seen catalogs with hundreds of 
thousands of tables. `GetObjects` require the full hierarchy of a catalog to be 
materialized in memory before anything can be returned to the client. In JDBC, 
the tabular `ResultSet` returned from `DatabaseMetaData#getTables` can be 
backed by iterators that don't materialze more than one page each from the 
catalog, schema, and table APIs.
   
   **Data lake catalogs are slow.** It can take minutes to list all the tables 
in a schema. JDBC is able to provide liveness: the client can start showing the 
first table quickly, and continue to load tables in the background.
   
   The resource usage and slowness become a problem in BI tools. Many of them 
assume that listing all tables that a user has access to is a cheap and quick 
operation. We have seen examples where they call synchronously call JDBC 
`DatabaseMetaData#getTables` from a UI thread, passing null for all parameters 
(meaning "give me everything"). While we can't do much about the time it takes 
to fulfill this request, in JDBC we can at least minimize the amount of memory 
used by not materializing more than strictly necessary, and in theory we can 
provide liveness.
   
   **Modern catalogs have arbitrary depth.** In common with JDBC and ODBC, ADBC 
assumes that there is a three level hierarchy: catalog, schema, table. The Glue 
Data Catalog now supports nested catalogs. Since this causes problems almost 
everywhere, there is a flat view of the catalogs, but this is something that 
should influence the design of the ADBC metadata API.
   
   I don't know if a modern metadata API should support arbitrary nesting, but 
maybe? A catalog is more like a graph than a table, and though `GetCatalogs`, 
`GetSchemas`, `GetTables` isn't perfect, it matches the graph nature slightly 
better. Arrow doesn't do graphs very well and I found the list-based approach 
of `GetObject` difficult to work with. In many ways, the tabular style of JDBC 
and OBDC feels like it would fit Arrow much better.
   
   #4400 proposes a generalized API, but I must admit I don't entirely follow 
the proposal since I'm not familiar with ADO.NET – I should read up.
   
   *The API is vague when it comes to filtering.* It's an easy to miss detail 
that the catalog, schema, table, and column parameters to `GetObjects` are 
patterns. Filtering with `LIKE` patterns enables filter pushdown in some 
implementations (if you dare passing the strings down into SQL). It's an 
ODBC/JDBC feature I wish wasn't replicated. Catalog objects tend to be named 
using snake case, and it's very easy to miss that the parameters are patterns, 
not names, and that underscore matches any character. Implementations that 
don't support `LIKE` patterns have to translate these to native filters, and/or 
implementing filtering client side.
   
   I don't have a solution to this, I just wish it was more explicit – it 
should be obvious to client code when a parameter is a pattern and when it is 
not, and for implementations when filtering is expected and when it's not. I 
think ODBC has a feature to turn wildcard handling on and off.
   
   An concrete example: when you use `GetObjects` to list tables, the catalog 
and schema are patterns. This means that the implementation needs to list all 
catalogs, find matches, then list schemas, find matches, and finally list 
tables. Most of the time the catalog and schema will be a specific catalog and 
schema, and listing and filtering is unnecessary overhead. The implementation 
could be optimized by looking at the patterns and if they don't contain 
wildcards, do a direct lookup for the catalog and schema instead of listing and 
filtering (maybe the lookups can be skipped entirely since the metadata for the 
catalog and schema levels are just the names) – but this assumes the client has 
understood the API and is not passing unescaped underscores!
   
   It would help clients if they didn't have to know that underscores have to 
be escaped to avoid the overhead of listing catalogs and schemas, and it would 
help implementations if they knew whether a pattern is a pattern or a concrete 
value
   
   I don't know if JDBC/ODBC style `GetCatalogs`, `GetSchemas`, `GetTables`, 
`GetColums` is the answer. My observation from having implemented support for 
the Glue Data Catalog is that 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to