kaxil opened a new pull request, #62599:
URL: https://github.com/apache/airflow/pull/62599

   Add `LLMSQLQueryOperator` and `@task.llm_sql` decorator — a SQL-specific 
extension of
   `LLMOperator` (#62598) that generates, validates, and returns SQL from 
natural language prompts.
   
   **Depends on #62598** (`LLMOperator` / `@task.llm`).
   
   ## Summary
   
   - **`LLMSQLQueryOperator(LLMOperator)`** inherits prompt handling, hook 
creation, and `agent_params`
     from the base `LLMOperator`, adding SQL-specific behavior:
     - Built-in system prompt enforcing SELECT-only safety rules
     - `system_prompt` from user is *appended* as "Additional instructions" 
(doesn't replace safety rules)
     - Schema introspection via `db_conn_id` + `table_names` (uses 
`DbApiHook.get_table_schema()`)
     - SQL validation via `sqlglot` — blocks DDL/DML by default, configurable 
via `allowed_sql_types`
     - Markdown code-fence stripping from LLM output
     - Dialect auto-detection from `DbApiHook.dialect_name`
   - **`@task.llm_sql`** decorator — callable returns the prompt string, 
operator handles the rest
   - **`DbApiHook.get_table_schema()`** added to `common.sql` — returns column 
metadata for a table
   - **SQL validation utility** in `common.ai.utils.sql_validation` — parses 
with sqlglot, checks statement types
   
   ## Design decisions
   
   **Why inherit from `LLMOperator`?** Both operators share `prompt`, 
`llm_conn_id`, `model_id`,
   `system_prompt`, `agent_params`, and the `llm_hook` property. Inheritance 
eliminates ~40 lines of
   duplication and gives SQL users `agent_params` (retries, temperature) for 
free.
   
   **Why append `system_prompt` instead of replacing?** The built-in SQL safety 
prompt ("Never generate
   DDL/DML...") is a guardrail. User `system_prompt` adds domain-specific 
guidance on top
   ("prefer CTEs", "use LEFT JOINs") without accidentally removing safety 
instructions.
   
   **Why `sqlglot` for validation?** It parses SQL across dialects without a 
database connection. The
   operator validates *before* execution, catching `DROP TABLE` or `INSERT` 
statements from the LLM
   before they touch any database.
   
   **Why `schema_context` overrides introspection?** Users may want to provide 
curated schema info
   (e.g., only relevant columns, with business descriptions) rather than raw 
introspected metadata.
   When `schema_context` is set, `db_conn_id`/`table_names` introspection is 
skipped.
   
   ## Usage
   
   ```python
   # Minimal — just a prompt and connection
   llm_sql = LLMSQLQueryOperator(
       task_id="generate_sql",
       prompt="Get active users created this month",
       llm_conn_id="openai_default",
   )
   
   # With schema introspection and validation
   llm_sql = LLMSQLQueryOperator(
       task_id="generate_sql",
       prompt="Get active users created this month",
       llm_conn_id="openai_default",
       db_conn_id="postgres_default",
       table_names=["users", "orders"],
       system_prompt="Prefer CTEs over subqueries.",
       agent_params={"model_settings": {"temperature": 0}},
   )
   
   # Decorator
   @task.llm_sql(llm_conn_id="openai_default", db_conn_id="postgres_default")
   def user_query(region: str):
       return f"Get top 10 users by order count in {region}"
   ```
   
   Co-authored-by: GPK <[email protected]>
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [ ] Yes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to