fl0-m opened a new issue, #40465:
URL: https://github.com/apache/superset/issues/40465

   ### Bug description
   
   ### Bug description
   
   In Superset 6.1.0, the new streaming CSV export pipeline introduced by 
#35478 (*"feat(streaming): Streaming CSV uploads for over 100k records for 
constant memory usage"*) bypasses Superset's standard query-preparation 
pipeline. This produces two distinct regressions, both reproducible against 
Trino.
   
   **Bug 1 — CSV exports crash on Trino with `__STREAM_ERROR__`**
   
   The streaming path in 
`superset/commands/streaming_export/base.py::_execute_query_and_stream` sends 
raw chart SQL directly to `engine.execute(text(sql))` without running it 
through `database.mutate_sql_based_on_config()` first. The SQL Superset 
generates for a chart ends with a `LIMIT N;` line — and Trino's HTTP statement 
endpoint rejects trailing semicolons as `mismatched input ';'. Expecting: 
<EOF>`.
   
   Because the streaming response has already flushed headers by the time the 
exception fires, Flask cannot change the status code. The generator instead 
writes the sentinel string `__STREAM_ERROR__: Export failed. Please try again 
in some time.` (63 bytes) into the response body and closes the stream. The 
user receives an HTTP 200 with that text inside what should have been their CSV 
file. The frontend has no way to distinguish this from a successful download.
   
   **Bug 2 — User impersonation is bypassed**
   
   On databases configured with `impersonate_user: true` (Trino, Presto, etc.), 
every other Superset execution site acquires the engine via 
`database.get_sqla_engine_with_context(user_name=…)` so the end user's identity 
is forwarded as the `X-Trino-User` header. The streaming export path acquires 
its engine without this context and runs every query as the service principal.
   
   Consequences:
   - **Audit trail broken** — every CSV export, from every user, shows up in 
the Trino query log as the service account.
   - **Resource-group routing broken** — exports no longer route to the user's 
configured Trino resource group.
   - **Possible authorization bypass** — engines that key per-user authz off 
`X-Trino-User` (Ranger, OPA, file-based ACLs, row/column-level security via 
session-aware views) will see the service account on the streaming path. A 
Superset user may be able to export data via "Download CSV" that they are not 
permitted to read via SQL Lab.
   
   Bug 1 is the visible crash. Bug 2 is independently reproducible — even with 
bug 1 patched, every CSV in the Trino query log is misattributed.
   
   The non-streaming export paths (Excel export, SQL Lab, `/api/v1/chart/data` 
JSON renders) are unaffected because they go through the proper pipeline.
   
   ### How to reproduce the bug
   
   1. Connect Superset 6.1.0 to a Trino cluster with `impersonate_user: true`.
   2. Create a dashboard tile or standalone chart backed by a Trino dataset.
   3. As any logged-in OAuth user (not the service principal), click `…` → 
`Download` → `Export to CSV`.
   4. Open the downloaded file.
   5. Open the Trino UI / query history and locate the corresponding query.
   
   **Expected**
   
   - The CSV contains the chart's data.
   - The Trino query record shows `User: <logged-in user>`, the user's normal 
resource group, and the database's default schema.
   
   **Actual**
   
   - The downloaded file is 63 bytes and contains only:
     ```
     __STREAM_ERROR__: Export failed. Please try again in some time.
     ```
   - The Trino query record shows:
     - `Error Type: USER_ERROR`
     - `Error Code: SYNTAX_ERROR (1)`
     - `Message: line N:13: mismatched input ';'. Expecting: <EOF>`
     - `User: <service principal>` (not the end user)
     - `Resource Group: n/a`
     - `Schema: <empty>`
   
   Performing the same action with `Export to Excel` instead of `Export to CSV` 
works correctly and shows the end user, the right resource group, the default 
schema, and a sqlglot-reformatted SQL body.
   
   ### Side-by-side evidence
   
   Same chart, same user, two consecutive export attempts seconds apart.
   
   **Failing CSV export — streaming path**
   ```
   User:            superset
   Principal:       superset
   Source:          Apache Superset
   Catalog:         my_catalog
   Schema:          (empty)
   Resource Group:  n/a
   Status:          USER_ERROR / SYNTAX_ERROR
   SQL (last line): LIMIT 500000;
   SQL form:        raw, lowercase keywords, DATE '2026-05-20'
   ```
   
   **Succeeding Excel export — non-streaming path**
   ```
   User:            [email protected]               <-- end user via 
X-Trino-User
   Principal:       superset
   Source:          Apache Superset
   Catalog:         my_catalog
   Schema:          my_schema
   Resource Group:  analysts
   Status:          FINISHED
   SQL (last line): LIMIT 500000
   SQL form:        uppercased keywords, CAST('2026-05-20' AS DATE)
   ```
   
   Both SQL strings are derived from the same chart definition. The differences 
(trailing `;`, missing sqlglot reformat, missing schema context, missing user 
impersonation) are all consequences of the streaming path skipping 
`mutate_sql_based_on_config()` and `get_sqla_engine_with_context(user_name=…)`.
   
   ### Minimal SQL illustrating the difference
   
   What the streaming CSV path sends to Trino (fails):
   ```sql
   SELECT category AS category, region AS region, sum(amount) AS "SUM(amount)"
   FROM (select date, order_id, region, amount, category
         from my_catalog.my_schema.orders) AS virtual_table
   WHERE date >= DATE '2026-05-20' AND date < DATE '2026-05-27'
     AND amount > 100 AND region IS NOT NULL
   GROUP BY category, region
   ORDER BY "SUM(amount)" DESC
   LIMIT 500000;
   ```
   
   What the non-streaming Excel path sends to Trino (works):
   ```sql
   SELECT
     category AS category,
     region AS region,
     SUM(amount) AS "SUM(amount)"
   FROM (
     SELECT date, order_id, region, amount, category
     FROM my_catalog.my_schema.orders
   ) AS virtual_table
   WHERE
     date >= CAST('2026-05-20' AS DATE)
     AND date < CAST('2026-05-27' AS DATE)
     AND amount > 100
     AND NOT region IS NULL
   GROUP BY category, region
   ORDER BY "SUM(amount)" DESC
   LIMIT 500000
   ```
   
   ### Stack trace
   
   ```
   ERROR:superset.commands.streaming_export.base:Traceback: Traceback (most 
recent call last):
     File ".../sqlalchemy/engine/base.py", line 1910, in _execute_context
       self.dialect.do_execute(
     File ".../trino/sqlalchemy/dialect.py", line 442, in do_execute
       cursor.execute(statement, parameters)
     File ".../trino/dbapi.py", line 640, in execute
       self._iterator = iter(self._query.execute())
     File ".../trino/client.py", line 938, in execute
       self._result.rows += self.fetch()
     File ".../trino/client.py", line 958, in fetch
       status = self._request.process(response)
     File ".../trino/client.py", line 727, in process
       raise self._process_error(response["error"], response.get("id"))
   trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, 
name=SYNTAX_ERROR,
       message="line 24:13: mismatched input ';'. Expecting: <EOF>", 
query_id=...)
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/app/superset/commands/streaming_export/base.py", line 225, in 
csv_generator
       yield from self._execute_query_and_stream(sql, database, limit)
     File "/app/superset/commands/streaming_export/base.py", line 168, in 
_execute_query_and_stream
       ).execute(text(sql))
     ...
   sqlalchemy.exc.ProgrammingError: (trino.exceptions.TrinoUserError) 
TrinoUserError(
       type=USER_ERROR, name=SYNTAX_ERROR,
       message="line 24:13: mismatched input ';'. Expecting: <EOF>", 
query_id=...)
   ```
   
   Trino-side parser stack (from the corresponding query in the Trino UI):
   ```
   io.trino.sql.parser.ParsingException: line 24:13: mismatched input ';'. 
Expecting: <EOF>
       at io.trino.sql.parser.ErrorHandler.syntaxError(ErrorHandler.java:108)
       ...
       at 
io.trino.dispatcher.DispatchManager.createQueryInternal(DispatchManager.java:225)
   ```
   
   ### Environment
   
   - **Superset version:** 6.1.0
   - **Database engine:** Trino 480 (`trino-python-client` via SQLAlchemy)
   - **DB connection setting:** `impersonate_user: true`
   - **Python:** 3.10
   - **Deployment:** Helm chart on Kubernetes
   - **Auth:** OAuth2
   
   ### Severity
   
   I'd argue release-blocker class for two reasons:
   
   1. **Functional:** every dashboard/chart CSV export against Trino or Presto 
in 6.1.0 is broken, with no in-UI signal of failure (HTTP 200 + sentinel text 
inside the file).
   2. **Security:** missing impersonation may silently bypass per-user 
authorization on deployments that key Trino authz off `X-Trino-User`. Any 
deployment using Ranger / OPA / file-based ACLs / RLS views with Superset + 
Trino should validate before upgrading.
   
   ### Screenshots/recordings
   
   _No response_
   
   ### Superset version
   
   master / latest-dev
   
   ### Python version
   
   3.10
   
   ### Node version
   
   I don't know
   
   ### Browser
   
   Chrome
   
   ### Additional context
   
   _No response_
   
   ### Checklist
   
   - [x] I have searched Superset docs and Slack and didn't find a solution to 
my problem.
   - [x] I have searched the GitHub issue tracker and didn't find a similar bug 
report.
   - [x] I have checked Superset's logs for errors and if I found a relevant 
Python stacktrace, I included it here as text in the "additional context" 
section.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to