codeant-ai-for-open-source[bot] commented on code in PR #38677:
URL: https://github.com/apache/superset/pull/38677#discussion_r2940234538
##########
superset/views/core.py:
##########
@@ -229,6 +233,18 @@ def _generate_xlsx(viz_obj: BaseViz) -> FlaskResponse:
xlsx_data = df_to_excel(df, index=False)
return XlsxResponse(xlsx_data,
headers=generate_download_headers("xlsx"))
+ @staticmethod
+ def _generate_pdf(viz_obj: BaseViz) -> FlaskResponse:
+ payload = viz_obj.get_df_payload()
+ df = payload.get("df")
+ records = df.to_dict("records") if df is not None else []
+ pdf_data = build_pdf_from_chart_data(records)
+ return Response(
+ pdf_data,
+ headers=generate_download_headers("pdf"),
+ mimetype="application/pdf",
+ )
Review Comment:
**Suggestion:** The new PDF export path ignores query execution errors and
still returns a successful PDF (often with empty data) when `get_df_payload`
reports failures. This masks real query errors and breaks the API contract
compared with other response types that return a 400 payload on failure. Check
`viz_obj.has_error(payload)` before building the PDF and return
`json_error_response` when errors are present. [logic error]
<details>
<summary><b>Severity Level:</b> Critical 🚨</summary>
```mdx
- ❌ Explore PDF export hides query failures as success.
- ⚠️ Clients miss expected 400 error payload contract.
- ⚠️ Troubleshooting broken chart queries becomes significantly harder.
```
</details>
```suggestion
@staticmethod
def _generate_pdf(viz_obj: BaseViz) -> FlaskResponse:
payload = viz_obj.get_df_payload()
if viz_obj.has_error(payload):
return json_error_response(payload=payload, status=400)
df = payload.get("df")
records = df.to_dict("records") if df is not None else []
pdf_data = build_pdf_from_chart_data(records)
return Response(
pdf_data,
headers=generate_download_headers("pdf"),
mimetype="application/pdf",
)
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. In Explore legacy export flow, frontend sets PDF downloads to
`/superset/explore_json/?pdf=true` (verified in
`superset-frontend/src/explore/exploreUtils/index.ts:275-277`, and tests at
`.../exportChart.test.ts:40`).
2. Request reaches `Superset.explore_json()` at
`superset/views/core.py:322-337`, which
recognizes `ChartDataResultFormat.PDF` and routes to `generate_json()`
(`core.py:192-205`), then `_generate_pdf()` (`core.py:237-246`).
3. Trigger a chart query failure (realistic path: bad query object/invalid
column), where
`viz_obj.get_df_payload()` sets `status=FAILED`, populates `errors`, and
leaves `df=None`
(`superset/viz.py:95-117`, payload fields at `viz.py:626-637`).
4. `_generate_pdf()` does not check `viz_obj.has_error(payload)` and still
builds a PDF
from empty rows; `build_pdf_from_chart_data()` converts empty data into `"No
data
available."` content (`superset/utils/pdf.py:195-197`) and returns HTTP 200.
5. This differs from existing error contract in same view:
`get_raw_results()` returns
`json_error_response(..., 400)` when `has_error(payload)` is true
(`superset/views/core.py:170-173`).
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/views/core.py
**Line:** 236:246
**Comment:**
*Logic Error: The new PDF export path ignores query execution errors
and still returns a successful PDF (often with empty data) when
`get_df_payload` reports failures. This masks real query errors and breaks the
API contract compared with other response types that return a 400 payload on
failure. Check `viz_obj.has_error(payload)` before building the PDF and return
`json_error_response` when errors are present.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=01343481e07cde56c2ebfb6f28dfe74d66cf132bae20dfe5dd8c28d1af7bee80&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=01343481e07cde56c2ebfb6f28dfe74d66cf132bae20dfe5dd8c28d1af7bee80&reaction=dislike'>👎</a>
##########
superset/utils/pdf.py:
##########
@@ -46,3 +62,249 @@ def build_pdf_from_screenshots(snapshots: list[bytes]) ->
bytes:
) from ex
return new_pdf.read()
+
+
+def _normalize_chart_row_value(value: Any) -> str:
+ """
+ Convert chart row values into printable text for PDF export.
+ """
+ if value is None:
+ return ""
+ if isinstance(value, (list, tuple, dict, set)):
+ value_str = str(value)
+ else:
+ value_str = str(value)
+ value_str = NEWLINE_PATTERN.sub(" ", value_str).replace("\t", " ")
+ return value_str
+
+
+def _load_table_font() -> Any:
+ """
+ Load a monospaced font when available, with a safe fallback.
+ """
+ font_candidates = [
+ "/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf",
+ "/usr/share/fonts/truetype/liberation/LiberationMono-Regular.ttf",
+ "/usr/share/fonts/dejavu/DejaVuSansMono.ttf",
+ ]
+ for font_path in font_candidates:
+ try:
+ return ImageFont.truetype(font_path, 15)
+ except OSError:
+ continue
+ return ImageFont.load_default()
+
+
+def _measure_text_metrics(font: Any) -> tuple[int, int]:
+ """
+ Return approximate single-character width and line height for the font.
+ """
+ probe = Image.new("RGB", (1, 1), "white")
+ draw = ImageDraw.Draw(probe)
+ char_bbox = draw.textbbox((0, 0), "0", font=font)
+ line_bbox = draw.textbbox((0, 0), "Ag", font=font)
+ char_width = max(char_bbox[2] - char_bbox[0], 1)
+ line_height = max((line_bbox[3] - line_bbox[1]) + 2, 12)
+ return char_width, line_height
+
+
+def _extract_columns(rows: list[dict[str, str]]) -> list[str]:
+ columns: list[str] = []
+ for row in rows:
+ for key in row:
+ if key not in columns:
+ columns.append(key)
+ return columns
+
+
+def _estimate_column_width_chars(
+ columns: list[str], rows: list[dict[str, str]]
+) -> list[int]:
+ widths: list[int] = []
+ for column in columns:
+ max_length = len(column)
+ for row in rows:
+ cell_value = row.get(column, "")
+ max_length = max(max_length, len(cell_value))
+ widths.append(min(max(max_length, MIN_COLUMN_CHARS), MAX_COLUMN_CHARS))
+ return widths
+
+
+def _wrap_cell(value: str, width_chars: int) -> list[str]:
+ wrapped = textwrap.wrap(
+ value,
+ width=width_chars,
+ replace_whitespace=False,
+ drop_whitespace=False,
+ break_long_words=True,
+ break_on_hyphens=False,
+ )
+ return wrapped or [""]
+
+
+def _draw_row_chunk(
+ draw: Any,
+ x_origin: int,
+ y_origin: int,
+ row_cells: list[list[str]],
+ chunk_start: int,
+ chunk_line_count: int,
+ column_widths_px: list[int],
+ font: Any,
+ line_height: int,
+ header: bool = False,
+) -> int:
+ row_height = (chunk_line_count * line_height) + (CELL_PADDING_Y * 2)
+ x_position = x_origin
+ for idx, cell_lines in enumerate(row_cells):
+ cell_width = column_widths_px[idx]
+ draw.rectangle(
+ (
+ x_position,
+ y_origin,
+ x_position + cell_width,
+ y_origin + row_height,
+ ),
+ fill=HEADER_BG_COLOR if header else "white",
+ outline=GRID_COLOR,
+ )
+ line_cursor = y_origin + CELL_PADDING_Y
+ for line_index in range(chunk_start, chunk_start + chunk_line_count):
+ text_line = cell_lines[line_index] if line_index < len(cell_lines)
else ""
+ draw.text(
+ (x_position + CELL_PADDING_X, line_cursor),
+ text_line,
+ fill=TEXT_COLOR,
+ font=font,
+ )
+ line_cursor += line_height
+ x_position += cell_width
+ return row_height
+
+
+def build_pdf_from_chart_data(rows: list[dict[str, Any]]) -> bytes: # noqa:
C901
+ """
+ Build a paginated table PDF document from chart row data.
+ """
+ try:
+ normalized_rows = [
+ {str(key): _normalize_chart_row_value(value) for key, value in
row.items()}
+ for row in rows
+ ]
+ columns = _extract_columns(normalized_rows)
+ if not columns:
+ columns = ["message"]
+ normalized_rows = [{"message": "No data available."}]
+
+ font = _load_table_font()
+ char_width, line_height = _measure_text_metrics(font)
+ column_widths_chars = _estimate_column_width_chars(columns,
normalized_rows)
+ column_widths_px = [
+ (width_chars * char_width) + (CELL_PADDING_X * 2)
+ for width_chars in column_widths_chars
+ ]
+ table_width = sum(column_widths_px)
+ page_width = max(DEFAULT_PAGE_WIDTH, (PAGE_MARGIN * 2) + table_width)
+ page_height = DEFAULT_PAGE_HEIGHT
Review Comment:
**Suggestion:** The computed page width grows directly with column count and
can become extremely large, causing high memory allocation when creating PIL
images and potentially crashing workers. Add an upper bound to the allowed page
width and fail with a controlled error when input exceeds that limit. [possible
bug]
<details>
<summary><b>Severity Level:</b> Critical 🚨</summary>
```mdx
- ❌ Wide chart PDFs can exhaust memory.
- ❌ Report workers may crash generating PDF_NEW attachments.
```
</details>
```suggestion
table_width = sum(column_widths_px)
required_page_width = (PAGE_MARGIN * 2) + table_width
max_page_width = DEFAULT_PAGE_WIDTH * 4
if required_page_width > max_page_width:
raise ValueError("Too many columns to render safely as PDF")
page_width = max(DEFAULT_PAGE_WIDTH, required_page_width)
page_height = DEFAULT_PAGE_HEIGHT
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Trigger chart PDF export through `ChartDataRestApi`
(`superset/charts/data/api.py:74`,
PDF handling at `:451` and `:466`) using a result set with many columns.
2. In `build_pdf_from_chart_data()` (`superset/utils/pdf.py:185`), each
column gets pixel
width (`:202-205`), then all widths are summed (`:206`) and directly used
for `page_width`
(`:207`) without limit.
3. PIL allocates the full canvas with `Image.new("RGB", (page_width,
page_height),
"white")` (`superset/utils/pdf.py:221`), so very wide data creates very
large memory
allocations.
4. The same path is used by scheduled chart-data PDFs
(`superset/commands/report/execute.py:446-470` calls chart data URL with
`result_format=PDF` at `:457`), so worker processes can OOM and fail
requests/reports.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/pdf.py
**Line:** 206:208
**Comment:**
*Possible Bug: The computed page width grows directly with column count
and can become extremely large, causing high memory allocation when creating
PIL images and potentially crashing workers. Add an upper bound to the allowed
page width and fail with a controlled error when input exceeds that limit.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=8d8e57567c00dd8deb6c2136dc6f519a6e7da594a20bb889eefdc780f4690e03&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=8d8e57567c00dd8deb6c2136dc6f519a6e7da594a20bb889eefdc780f4690e03&reaction=dislike'>👎</a>
##########
superset/commands/report/execute.py:
##########
@@ -436,6 +443,64 @@ def _get_pdf(self) -> bytes:
return pdf
+ def _get_chart_data_pdf(self) -> bytes:
+ """
+ Get chart-data based pdf for chart reports.
+ :raises: ReportSchedulePdfFailedError
+ """
+ if not self._report_schedule.chart:
+ raise ReportSchedulePdfFailedError(
+ "Chart data pdf export is only supported for chart reports"
+ )
+
+ start_time = datetime.utcnow()
+ url = self._get_url(result_format=ChartDataResultFormat.PDF)
+ _, username = get_executor(
+ executors=app.config["ALERT_REPORTS_EXECUTORS"],
+ model=self._report_schedule,
+ )
+ user = security_manager.find_user(username)
+ auth_cookies =
machine_auth_provider_factory.instance.get_auth_cookies(user)
+
+ if self._report_schedule.chart.query_context is None:
+ logger.warning("No query context found, taking a screenshot to
generate it")
+ self._update_query_context()
+
+ try:
+ chart_data = get_chart_csv_data(chart_url=url,
auth_cookies=auth_cookies)
+ elapsed_seconds = (datetime.utcnow() - start_time).total_seconds()
+ logger.info(
+ "Chart data PDF generation from %s as user %s took %.2fs -
execution_id: %s", # noqa: E501
+ url,
+ username,
+ elapsed_seconds,
+ self._execution_id,
+ )
+ except SoftTimeLimitExceeded as ex:
+ elapsed_seconds = (datetime.utcnow() - start_time).total_seconds()
+ logger.warning(
+ "Chart data PDF generation timeout after %.2fs - execution_id:
%s",
+ elapsed_seconds,
+ self._execution_id,
+ )
+ raise ReportSchedulePdfFailedError(
+ "A timeout occurred while generating a pdf."
+ ) from ex
+ except Exception as ex:
+ elapsed_seconds = (datetime.utcnow() - start_time).total_seconds()
+ logger.error(
+ "Chart data PDF generation failed after %.2fs - execution_id:
%s",
+ elapsed_seconds,
+ self._execution_id,
+ )
+ raise ReportSchedulePdfFailedError(
+ f"Failed generating pdf {str(ex)}"
+ ) from ex
+
+ if not chart_data:
+ raise ReportSchedulePdfFailedError()
+ return chart_data
Review Comment:
**Suggestion:** The chart-data fetch result is returned as PDF bytes without
validating the payload format. For multi-query charts, the chart-data API
returns a ZIP, and this code will attach ZIP bytes as a `.pdf`, producing
corrupted attachments and runtime integration failures in downstream consumers.
[possible bug]
<details>
<summary><b>Severity Level:</b> Critical 🚨</summary>
```mdx
- ❌ Multi-query PDF_NEW emails can attach invalid PDF files.
- ❌ Webhook uploads mislabeled ZIP bytes as application/pdf.
- ⚠️ Report run appears successful despite bad attachment.
```
</details>
```suggestion
chart_data = get_chart_csv_data(chart_url=url,
auth_cookies=auth_cookies)
elapsed_seconds = (datetime.utcnow() -
start_time).total_seconds()
logger.info(
"Chart data PDF generation from %s as user %s took %.2fs -
execution_id: %s", # noqa: E501
url,
username,
elapsed_seconds,
self._execution_id,
)
...
if not chart_data or not chart_data.startswith(b"%PDF-"):
raise ReportSchedulePdfFailedError(
"Chart data export did not return a valid PDF payload."
)
return chart_data
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Trigger scheduled `PDF_NEW` chart report execution via `reports.execute`
(`superset/tasks/scheduler.py:57-75`) so code reaches `_get_chart_data_pdf()`
(`execute.py:446-502`).
2. `_get_chart_data_pdf()` requests chart-data export URL with PDF format
(`execute.py:457`) and fetches raw bytes using `get_chart_csv_data()`
(`superset/utils/csv.py:15-29`), which does no content-type validation.
3. For charts producing multiple query results, chart-data API returns a ZIP
response in
`_send_chart_response()` (`superset/charts/data/api.py:71-88`) even when
`result_format ==
PDF`.
4. Current code only checks non-empty bytes (`execute.py:500-502`) and
passes ZIP bytes as
`NotificationContent.pdf`; email/webhook attach them as `.pdf`
(`superset/reports/notifications/email.py:27-35`, `.../webhook.py:84`),
yielding invalid
PDF attachments.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/commands/report/execute.py
**Line:** 470:502
**Comment:**
*Possible Bug: The chart-data fetch result is returned as PDF bytes
without validating the payload format. For multi-query charts, the chart-data
API returns a ZIP, and this code will attach ZIP bytes as a `.pdf`, producing
corrupted attachments and runtime integration failures in downstream consumers.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=792a40c1e2ada82ad10929a0e71f1bc6dce2b44c2e94c967bd1de22daf79c849&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=792a40c1e2ada82ad10929a0e71f1bc6dce2b44c2e94c967bd1de22daf79c849&reaction=dislike'>👎</a>
##########
superset/commands/report/execute.py:
##########
@@ -611,6 +676,13 @@ def _get_notification_content(self) ->
NotificationContent: # noqa: C901
pdf_data = self._get_pdf()
if not pdf_data:
error_text = "Unexpected missing pdf"
+ elif self._report_schedule.report_format ==
ReportDataFormat.PDF_NEW:
+ if self._report_schedule.chart:
+ pdf_data = self._get_chart_data_pdf()
+ if not pdf_data:
+ error_text = "Unexpected missing pdf"
+ else:
+ error_text = "PDF NEW is only supported for chart reports"
Review Comment:
**Suggestion:** When `PDF_NEW` is configured for a dashboard, the code only
sets `error_text` and returns normal notification content instead of raising an
exception. This makes the scheduler mark the run as successful even though
report generation is invalid, causing silent misreporting of execution state.
[logic error]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ Invalid dashboard PDF_NEW runs recorded as successful.
- ⚠️ Monitoring misses real report configuration failures.
- ⚠️ Operators receive error text, not failure state.
```
</details>
```suggestion
else:
raise ReportSchedulePdfFailedError(
"PDF NEW is only supported for chart reports"
)
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Create or update a dashboard report through API `POST/PUT /api/v1/report`
(`superset/reports/api.py:315-395`) with `report_format=PDF_NEW`; schemas
accept any enum
value (`superset/reports/schemas.py:228-231`, `:367-370`) without
chart/dashboard
compatibility validation.
2. Scheduler executes it through `reports.execute`
(`superset/tasks/scheduler.py:57-75`)
and `send()` (`execute.py:792-799`) calls `_get_notification_content()`
(`execute.py:653+`).
3. In `PDF_NEW` branch for dashboard (`execute.py:679-685`), code sets
`error_text`
instead of raising.
4. Function returns `NotificationContent` with text, `send()` completes, and
state machine
marks success (`execute.py:905-907` or `:1050-1051`) despite invalid report
configuration.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/commands/report/execute.py
**Line:** 684:685
**Comment:**
*Logic Error: When `PDF_NEW` is configured for a dashboard, the code
only sets `error_text` and returns normal notification content instead of
raising an exception. This makes the scheduler mark the run as successful even
though report generation is invalid, causing silent misreporting of execution
state.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=009665b086b44cd9745a02c1a0772e3165c06a455d467ab5b39ed7ada258040a&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=009665b086b44cd9745a02c1a0772e3165c06a455d467ab5b39ed7ada258040a&reaction=dislike'>👎</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]