iconara commented on issue #1704: URL: https://github.com/apache/arrow-adbc/issues/1704#issuecomment-4743374190
I've been looking at `GetObjects` from the point of view of implementing it for Amazon Athena. Athena does not have a catalog of its own, but uses the AWS Glue Data Catalog. There are many mismatches between ADBC's `GetObjects` and the reality of this catalog unfortunately: **Data lake catalogs can be huge.** I've seen catalogs with hundreds of thousands of tables. `GetObjects` require the full hierarchy of a catalog to be materialized in memory before anything can be returned to the client. In JDBC, the tabular `ResultSet` returned from `DatabaseMetaData#getTables` can be backed by iterators that don't materialze more than one page each from the catalog, schema, and table APIs. **Data lake catalogs are slow.** It can take minutes to list all the tables in a schema. JDBC is able to provide liveness: the client can start showing the first table quickly, and continue to load tables in the background. The resource usage and slowness become a problem in BI tools. Many of them assume that listing all tables that a user has access to is a cheap and quick operation. We have seen examples where they call synchronously call JDBC `DatabaseMetaData#getTables` from a UI thread, passing null for all parameters (meaning "give me everything"). While we can't do much about the time it takes to fulfill this request, in JDBC we can at least minimize the amount of memory used by not materializing more than strictly necessary, and in theory we can provide liveness. **Modern catalogs have arbitrary depth.** In common with JDBC and ODBC, ADBC assumes that there is a three level hierarchy: catalog, schema, table. The Glue Data Catalog now supports nested catalogs. Since this causes problems almost everywhere, there is a flat view of the catalogs, but this is something that should influence the design of the ADBC metadata API. I don't know if a modern metadata API should support arbitrary nesting, but maybe? A catalog is more like a graph than a table, and though `GetCatalogs`, `GetSchemas`, `GetTables` isn't perfect, it matches the graph nature slightly better. Arrow doesn't do graphs very well and I found the list-based approach of `GetObject` difficult to work with. In many ways, the tabular style of JDBC and OBDC feels like it would fit Arrow much better. #4400 proposes a generalized API, but I must admit I don't entirely follow the proposal since I'm not familiar with ADO.NET – I should read up. *The API is vague when it comes to filtering.* It's an easy to miss detail that the catalog, schema, table, and column parameters to `GetObjects` are patterns. Filtering with `LIKE` patterns enables filter pushdown in some implementations (if you dare passing the strings down into SQL). It's an ODBC/JDBC feature I wish wasn't replicated. Catalog objects tend to be named using snake case, and it's very easy to miss that the parameters are patterns, not names, and that underscore matches any character. Implementations that don't support `LIKE` patterns have to translate these to native filters, and/or implementing filtering client side. I don't have a solution to this, I just wish it was more explicit – it should be obvious to client code when a parameter is a pattern and when it is not, and for implementations when filtering is expected and when it's not. I think ODBC has a feature to turn wildcard handling on and off. An concrete example: when you use `GetObjects` to list tables, the catalog and schema are patterns. This means that the implementation needs to list all catalogs, find matches, then list schemas, find matches, and finally list tables. Most of the time the catalog and schema will be a specific catalog and schema, and listing and filtering is unnecessary overhead. The implementation could be optimized by looking at the patterns and if they don't contain wildcards, do a direct lookup for the catalog and schema instead of listing and filtering (maybe the lookups can be skipped entirely since the metadata for the catalog and schema levels are just the names) – but this assumes the client has understood the API and is not passing unescaped underscores! It would help clients if they didn't have to know that underscores have to be escaped to avoid the overhead of listing catalogs and schemas, and it would help implementations if they knew whether a pattern is a pattern or a concrete value I don't know if JDBC/ODBC style `GetCatalogs`, `GetSchemas`, `GetTables`, `GetColums` is the answer. My observation from having implemented support for the Glue Data Catalog is that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
