kaxil opened a new pull request, #68487:
URL: https://github.com/apache/airflow/pull/68487

   `SQLToolset(allowed_tables=[...])` previously restricted only metadata 
discovery (`list_tables`/`get_schema`); the `query` and `check_query` tools 
never checked it, so an agent could read any table by name (`SELECT * FROM 
secret`). This makes `allowed_tables` an enforced boundary on the query tools 
too: the SQL is parsed with sqlglot and rejected before execution if it reaches 
any table not on the list.
   
   ## What it catches
   
   Every table a query reaches must be on the list, resolved with its 
database/catalog: direct, subquery, CTE body, JOIN, set operations, `DESCRIBE`, 
`information_schema`/`pg_catalog`, and DML (with `allow_writes=True`). CTE 
references are excluded by lexical scope, so a same-named CTE in another scope 
cannot hide a real table.
   
   Constructs a `schema.table` list cannot describe are rejected while the list 
is active: table-valued functions (`dblink`), `TABLE('name')` row sources, the 
`TABLE <name>` shorthand, `SHOW`, dynamic SQL (`EXEC`), quoted identifiers 
(case-sensitive on the engine), cross-database references 
(`otherdb.public.orders`), and inline comments.
   
   ## Design rationale
   
   - Reject comments: `SELECT * FROM orders/*!UNION SELECT * FROM secret*/` 
runs the UNION arm on MySQL/MariaDB but is inert to sqlglot and other engines 
(a parser/engine differential). Rejecting comments closes the class instead of 
chasing variants.
   - Catalog-aware: the toolset targets one database, so a catalog-qualified 
reference to another database is refused.
   - Scope-aware CTE handling: excluding CTE names globally let an inner 
same-named CTE hide a real top-level table; exclusion now follows lexical scope 
and CTE ordering.
   
   ## Tradeoffs and limitations
   
   Application-level guardrail (parse-then-check): strong defense-in-depth, not 
a substitute for database permissions. It cannot police data reached through a 
function whose argument is itself SQL or a path: `pg_read_file('...')` (a file) 
or `query_to_xml('SELECT ... FROM other', ...)` / scalar `dblink` (a table, via 
a string the parser cannot read). For a hard boundary, run the connection as a 
least-privilege role with `SELECT` limited to the same tables. Quoted 
identifiers and comments are rejected while a list is active, so agents should 
send unquoted, comment-free SQL on restricted connections.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to