kaxil opened a new pull request, #68487:
URL: https://github.com/apache/airflow/pull/68487
`SQLToolset(allowed_tables=[...])` previously restricted only metadata
discovery (`list_tables`/`get_schema`); the `query` and `check_query` tools
never checked it, so an agent could read any table by name (`SELECT * FROM
secret`). This makes `allowed_tables` an enforced boundary on the query tools
too: the SQL is parsed with sqlglot and rejected before execution if it reaches
any table not on the list.
## What it catches
Every table a query reaches must be on the list, resolved with its
database/catalog: direct, subquery, CTE body, JOIN, set operations, `DESCRIBE`,
`information_schema`/`pg_catalog`, and DML (with `allow_writes=True`). CTE
references are excluded by lexical scope, so a same-named CTE in another scope
cannot hide a real table.
Constructs a `schema.table` list cannot describe are rejected while the list
is active: table-valued functions (`dblink`), `TABLE('name')` row sources, the
`TABLE <name>` shorthand, `SHOW`, dynamic SQL (`EXEC`), quoted identifiers
(case-sensitive on the engine), cross-database references
(`otherdb.public.orders`), and inline comments.
## Design rationale
- Reject comments: `SELECT * FROM orders/*!UNION SELECT * FROM secret*/`
runs the UNION arm on MySQL/MariaDB but is inert to sqlglot and other engines
(a parser/engine differential). Rejecting comments closes the class instead of
chasing variants.
- Catalog-aware: the toolset targets one database, so a catalog-qualified
reference to another database is refused.
- Scope-aware CTE handling: excluding CTE names globally let an inner
same-named CTE hide a real top-level table; exclusion now follows lexical scope
and CTE ordering.
## Tradeoffs and limitations
Application-level guardrail (parse-then-check): strong defense-in-depth, not
a substitute for database permissions. It cannot police data reached through a
function whose argument is itself SQL or a path: `pg_read_file('...')` (a file)
or `query_to_xml('SELECT ... FROM other', ...)` / scalar `dblink` (a table, via
a string the parser cannot read). For a hard boundary, run the connection as a
least-privilege role with `SELECT` limited to the same tables. Quoted
identifiers and comments are rejected while a list is active, so agents should
send unquoted, comment-free SQL on restricted connections.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]