rahul-madaan commented on code in PR #66844:
URL: https://github.com/apache/airflow/pull/66844#discussion_r3240176523
##########
providers/amazon/src/airflow/providers/amazon/aws/hooks/athena_sql.py:
##########
@@ -177,6 +177,36 @@ def _get_conn_params(self) -> dict[str, str | None]:
aws_domain=self.conn.extra_dejson.get("aws_domain",
"amazonaws.com"),
)
+ def get_openlineage_database_info(self, connection):
+ """Return Amazon Athena specific information for OpenLineage."""
+ from airflow.providers.openlineage.sqlparser import DatabaseInfo
+
+ region_name = connection.extra_dejson.get("region_name") or
self.region_name
+ authority = f"athena.{region_name}.amazonaws.com" if region_name else
"athena.amazonaws.com"
+
+ return DatabaseInfo(
+ scheme="awsathena",
+ authority=authority,
+ information_schema_columns=[
+ "table_schema",
+ "table_name",
+ "column_name",
+ "ordinal_position",
+ "data_type",
+ "table_catalog",
+ ],
+ database=connection.extra_dejson.get("catalog", "AwsDataCatalog"),
+ is_information_schema_cross_db=True,
Review Comment:
Hi @kacpermuda, yes — validated against a real Athena instance before
opening the PR.
## Full evidence
### Engine version is Trino (v3) — confirms dialect choice
```bash
$ aws athena list-work-groups --region us-east-1
{
"WorkGroups": [
{
"Name": "primary",
"EngineVersion": { "EffectiveEngineVersion": "Athena engine version 3"
}
}
]
}
```
### `AwsDataCatalog` is the live default catalog
```bash
$ aws athena list-data-catalogs --region us-east-1
{
"DataCatalogsSummary": [
{
"CatalogName": "AwsDataCatalog",
"Type": "GLUE",
"Status": "CREATE_COMPLETE"
}
]
}
```
### Real query against `information_schema.columns` with the exact 6 columns
declared by the hook — confirms `information_schema_columns` is correct
```bash
$ aws athena start-query-execution \
--query-string "
SELECT table_schema, table_name, column_name, ordinal_position,
data_type, table_catalog
FROM information_schema.columns
WHERE table_schema='information_schema'
LIMIT 3
" \
--query-execution-context
"Database=information_schema,Catalog=AwsDataCatalog" ...
```
**Result:** `State=SUCCEEDED`, `DataScanned=3401 bytes`,
`EngineVersion="Athena engine version 3"`
| table_schema | table_name | column_name | ordinal_position |
data_type | table_catalog |
| ------------------ | ---------------- | ------------ | ---------------- |
--------- | --------------- |
| information_schema | applicable_roles | grantee | 1 |
varchar | awsdatacatalog |
| information_schema | applicable_roles | grantee_type | 2 |
varchar | awsdatacatalog |
| information_schema | applicable_roles | role_name | 3 |
varchar | awsdatacatalog |
All six column names project correctly with the expected types — same as
`TrinoHook`.
### Real cross-DB query against `information_schema.tables` succeeds —
confirms `is_information_schema_cross_db=True`
```bash
$ aws athena start-query-execution \
--query-string "
SELECT table_catalog, table_schema, table_name
FROM information_schema.tables
WHERE table_schema='information_schema'
LIMIT 3
" ...
```
**Result:** `State=SUCCEEDED`, `DataScanned=452 bytes`
## On `use_flat_cross_db_query`
Good catch on the Redshift comparison. I deliberately left it as the default
`False` because Athena and Redshift have fundamentally different metadata
models:
- **Redshift** uses `SVV_REDSHIFT_COLUMNS`, a single global system view
spanning all databases. That's why it needs `use_flat_cross_db_query=True` — to
query the one view with `WHERE`-clause database filters.
- **Athena/Trino** uses the standard per-catalog `information_schema` (the
result above shows `table_catalog = awsdatacatalog` populated correctly).
There's no single global view; cross-DB queries work natively via Trino's
3-part naming, which is what `use_flat_cross_db_query=False` +
`is_information_schema_cross_db=True` generates: per-database queries combined
with `UNION ALL`.
This matches `TrinoHook.get_openlineage_database_info()` 1:1 — `TrinoHook`
also doesn't set `use_flat_cross_db_query`, and its
`information_schema_columns` list is identical. Athena engine v3 is Trino under
the hood, so the same OL parameters apply.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]