kaxil opened a new pull request, #62599:
URL: https://github.com/apache/airflow/pull/62599
Add `LLMSQLQueryOperator` and `@task.llm_sql` decorator — a SQL-specific
extension of
`LLMOperator` (#62598) that generates, validates, and returns SQL from
natural language prompts.
**Depends on #62598** (`LLMOperator` / `@task.llm`).
## Summary
- **`LLMSQLQueryOperator(LLMOperator)`** inherits prompt handling, hook
creation, and `agent_params`
from the base `LLMOperator`, adding SQL-specific behavior:
- Built-in system prompt enforcing SELECT-only safety rules
- `system_prompt` from user is *appended* as "Additional instructions"
(doesn't replace safety rules)
- Schema introspection via `db_conn_id` + `table_names` (uses
`DbApiHook.get_table_schema()`)
- SQL validation via `sqlglot` — blocks DDL/DML by default, configurable
via `allowed_sql_types`
- Markdown code-fence stripping from LLM output
- Dialect auto-detection from `DbApiHook.dialect_name`
- **`@task.llm_sql`** decorator — callable returns the prompt string,
operator handles the rest
- **`DbApiHook.get_table_schema()`** added to `common.sql` — returns column
metadata for a table
- **SQL validation utility** in `common.ai.utils.sql_validation` — parses
with sqlglot, checks statement types
## Design decisions
**Why inherit from `LLMOperator`?** Both operators share `prompt`,
`llm_conn_id`, `model_id`,
`system_prompt`, `agent_params`, and the `llm_hook` property. Inheritance
eliminates ~40 lines of
duplication and gives SQL users `agent_params` (retries, temperature) for
free.
**Why append `system_prompt` instead of replacing?** The built-in SQL safety
prompt ("Never generate
DDL/DML...") is a guardrail. User `system_prompt` adds domain-specific
guidance on top
("prefer CTEs", "use LEFT JOINs") without accidentally removing safety
instructions.
**Why `sqlglot` for validation?** It parses SQL across dialects without a
database connection. The
operator validates *before* execution, catching `DROP TABLE` or `INSERT`
statements from the LLM
before they touch any database.
**Why `schema_context` overrides introspection?** Users may want to provide
curated schema info
(e.g., only relevant columns, with business descriptions) rather than raw
introspected metadata.
When `schema_context` is set, `db_conn_id`/`table_names` introspection is
skipped.
## Usage
```python
# Minimal — just a prompt and connection
llm_sql = LLMSQLQueryOperator(
task_id="generate_sql",
prompt="Get active users created this month",
llm_conn_id="openai_default",
)
# With schema introspection and validation
llm_sql = LLMSQLQueryOperator(
task_id="generate_sql",
prompt="Get active users created this month",
llm_conn_id="openai_default",
db_conn_id="postgres_default",
table_names=["users", "orders"],
system_prompt="Prefer CTEs over subqueries.",
agent_params={"model_settings": {"temperature": 0}},
)
# Decorator
@task.llm_sql(llm_conn_id="openai_default", db_conn_id="postgres_default")
def user_query(region: str):
return f"Get top 10 users by order count in {region}"
```
Co-authored-by: GPK <[email protected]>
---
##### Was generative AI tooling used to co-author this PR?
- [ ] Yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]