Re: [PR] feat: add PDF export for Superset chart exports and introduce new PDF… [superset]

via GitHub Mon, 16 Mar 2026 06:02:30 -0700


codeant-ai-for-open-source[bot] commented on code in PR #38677:
URL: https://github.com/apache/superset/pull/38677#discussion_r2940234538



##########
superset/views/core.py:
##########
@@ -229,6 +233,18 @@ def _generate_xlsx(viz_obj: BaseViz) -> FlaskResponse:
         xlsx_data = df_to_excel(df, index=False)
         return XlsxResponse(xlsx_data, 
headers=generate_download_headers("xlsx"))
 
+    @staticmethod
+    def _generate_pdf(viz_obj: BaseViz) -> FlaskResponse:
+        payload = viz_obj.get_df_payload()
+        df = payload.get("df")
+        records = df.to_dict("records") if df is not None else []
+        pdf_data = build_pdf_from_chart_data(records)
+        return Response(
+            pdf_data,
+            headers=generate_download_headers("pdf"),
+            mimetype="application/pdf",
+        )

Review Comment:
   **Suggestion:** The new PDF export path ignores query execution errors and 
still returns a successful PDF (often with empty data) when `get_df_payload` 
reports failures. This masks real query errors and breaks the API contract 
compared with other response types that return a 400 payload on failure. Check 
`viz_obj.has_error(payload)` before building the PDF and return 
`json_error_response` when errors are present. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Explore PDF export hides query failures as success.
   - ⚠️ Clients miss expected 400 error payload contract.
   - ⚠️ Troubleshooting broken chart queries becomes significantly harder.
   ```
   </details>
   
   ```suggestion
       @staticmethod
       def _generate_pdf(viz_obj: BaseViz) -> FlaskResponse:
           payload = viz_obj.get_df_payload()
           if viz_obj.has_error(payload):
               return json_error_response(payload=payload, status=400)
           df = payload.get("df")
           records = df.to_dict("records") if df is not None else []
           pdf_data = build_pdf_from_chart_data(records)
           return Response(
               pdf_data,
               headers=generate_download_headers("pdf"),
               mimetype="application/pdf",
           )
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. In Explore legacy export flow, frontend sets PDF downloads to
   `/superset/explore_json/?pdf=true` (verified in
   `superset-frontend/src/explore/exploreUtils/index.ts:275-277`, and tests at
   `.../exportChart.test.ts:40`).
   
   2. Request reaches `Superset.explore_json()` at 
`superset/views/core.py:322-337`, which
   recognizes `ChartDataResultFormat.PDF` and routes to `generate_json()`
   (`core.py:192-205`), then `_generate_pdf()` (`core.py:237-246`).
   
   3. Trigger a chart query failure (realistic path: bad query object/invalid 
column), where
   `viz_obj.get_df_payload()` sets `status=FAILED`, populates `errors`, and 
leaves `df=None`
   (`superset/viz.py:95-117`, payload fields at `viz.py:626-637`).
   
   4. `_generate_pdf()` does not check `viz_obj.has_error(payload)` and still 
builds a PDF
   from empty rows; `build_pdf_from_chart_data()` converts empty data into `"No 
data
   available."` content (`superset/utils/pdf.py:195-197`) and returns HTTP 200.
   
   5. This differs from existing error contract in same view: 
`get_raw_results()` returns
   `json_error_response(..., 400)` when `has_error(payload)` is true
   (`superset/views/core.py:170-173`).
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/views/core.py
   **Line:** 236:246
   **Comment:**
        *Logic Error: The new PDF export path ignores query execution errors 
and still returns a successful PDF (often with empty data) when 
`get_df_payload` reports failures. This masks real query errors and breaks the 
API contract compared with other response types that return a 400 payload on 
failure. Check `viz_obj.has_error(payload)` before building the PDF and return 
`json_error_response` when errors are present.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=01343481e07cde56c2ebfb6f28dfe74d66cf132bae20dfe5dd8c28d1af7bee80&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=01343481e07cde56c2ebfb6f28dfe74d66cf132bae20dfe5dd8c28d1af7bee80&reaction=dislike'>👎</a>



##########
superset/utils/pdf.py:
##########
@@ -46,3 +62,249 @@ def build_pdf_from_screenshots(snapshots: list[bytes]) -> 
bytes:
         ) from ex
 
     return new_pdf.read()
+
+
+def _normalize_chart_row_value(value: Any) -> str:
+    """
+    Convert chart row values into printable text for PDF export.
+    """
+    if value is None:
+        return ""
+    if isinstance(value, (list, tuple, dict, set)):
+        value_str = str(value)
+    else:
+        value_str = str(value)
+    value_str = NEWLINE_PATTERN.sub(" ", value_str).replace("\t", "    ")
+    return value_str
+
+
+def _load_table_font() -> Any:
+    """
+    Load a monospaced font when available, with a safe fallback.
+    """
+    font_candidates = [
+        "/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf",
+        "/usr/share/fonts/truetype/liberation/LiberationMono-Regular.ttf",
+        "/usr/share/fonts/dejavu/DejaVuSansMono.ttf",
+    ]
+    for font_path in font_candidates:
+        try:
+            return ImageFont.truetype(font_path, 15)
+        except OSError:
+            continue
+    return ImageFont.load_default()
+
+
+def _measure_text_metrics(font: Any) -> tuple[int, int]:
+    """
+    Return approximate single-character width and line height for the font.
+    """
+    probe = Image.new("RGB", (1, 1), "white")
+    draw = ImageDraw.Draw(probe)
+    char_bbox = draw.textbbox((0, 0), "0", font=font)
+    line_bbox = draw.textbbox((0, 0), "Ag", font=font)
+    char_width = max(char_bbox[2] - char_bbox[0], 1)
+    line_height = max((line_bbox[3] - line_bbox[1]) + 2, 12)
+    return char_width, line_height
+
+
+def _extract_columns(rows: list[dict[str, str]]) -> list[str]:
+    columns: list[str] = []
+    for row in rows:
+        for key in row:
+            if key not in columns:
+                columns.append(key)
+    return columns
+
+
+def _estimate_column_width_chars(
+    columns: list[str], rows: list[dict[str, str]]
+) -> list[int]:
+    widths: list[int] = []
+    for column in columns:
+        max_length = len(column)
+        for row in rows:
+            cell_value = row.get(column, "")
+            max_length = max(max_length, len(cell_value))
+        widths.append(min(max(max_length, MIN_COLUMN_CHARS), MAX_COLUMN_CHARS))
+    return widths
+
+
+def _wrap_cell(value: str, width_chars: int) -> list[str]:
+    wrapped = textwrap.wrap(
+        value,
+        width=width_chars,
+        replace_whitespace=False,
+        drop_whitespace=False,
+        break_long_words=True,
+        break_on_hyphens=False,
+    )
+    return wrapped or [""]
+
+
+def _draw_row_chunk(
+    draw: Any,
+    x_origin: int,
+    y_origin: int,
+    row_cells: list[list[str]],
+    chunk_start: int,
+    chunk_line_count: int,
+    column_widths_px: list[int],
+    font: Any,
+    line_height: int,
+    header: bool = False,
+) -> int:
+    row_height = (chunk_line_count * line_height) + (CELL_PADDING_Y * 2)
+    x_position = x_origin
+    for idx, cell_lines in enumerate(row_cells):
+        cell_width = column_widths_px[idx]
+        draw.rectangle(
+            (
+                x_position,
+                y_origin,
+                x_position + cell_width,
+                y_origin + row_height,
+            ),
+            fill=HEADER_BG_COLOR if header else "white",
+            outline=GRID_COLOR,
+        )
+        line_cursor = y_origin + CELL_PADDING_Y
+        for line_index in range(chunk_start, chunk_start + chunk_line_count):
+            text_line = cell_lines[line_index] if line_index < len(cell_lines) 
else ""
+            draw.text(
+                (x_position + CELL_PADDING_X, line_cursor),
+                text_line,
+                fill=TEXT_COLOR,
+                font=font,
+            )
+            line_cursor += line_height
+        x_position += cell_width
+    return row_height
+
+
+def build_pdf_from_chart_data(rows: list[dict[str, Any]]) -> bytes:  # noqa: 
C901
+    """
+    Build a paginated table PDF document from chart row data.
+    """
+    try:
+        normalized_rows = [
+            {str(key): _normalize_chart_row_value(value) for key, value in 
row.items()}
+            for row in rows
+        ]
+        columns = _extract_columns(normalized_rows)
+        if not columns:
+            columns = ["message"]
+            normalized_rows = [{"message": "No data available."}]
+
+        font = _load_table_font()
+        char_width, line_height = _measure_text_metrics(font)
+        column_widths_chars = _estimate_column_width_chars(columns, 
normalized_rows)
+        column_widths_px = [
+            (width_chars * char_width) + (CELL_PADDING_X * 2)
+            for width_chars in column_widths_chars
+        ]
+        table_width = sum(column_widths_px)
+        page_width = max(DEFAULT_PAGE_WIDTH, (PAGE_MARGIN * 2) + table_width)
+        page_height = DEFAULT_PAGE_HEIGHT

Review Comment:
   **Suggestion:** The computed page width grows directly with column count and 
can become extremely large, causing high memory allocation when creating PIL 
images and potentially crashing workers. Add an upper bound to the allowed page 
width and fail with a controlled error when input exceeds that limit. [possible 
bug]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Wide chart PDFs can exhaust memory.
   - ❌ Report workers may crash generating PDF_NEW attachments.
   ```
   </details>
   
   ```suggestion
           table_width = sum(column_widths_px)
           required_page_width = (PAGE_MARGIN * 2) + table_width
           max_page_width = DEFAULT_PAGE_WIDTH * 4
           if required_page_width > max_page_width:
               raise ValueError("Too many columns to render safely as PDF")
           page_width = max(DEFAULT_PAGE_WIDTH, required_page_width)
           page_height = DEFAULT_PAGE_HEIGHT
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Trigger chart PDF export through `ChartDataRestApi` 
(`superset/charts/data/api.py:74`,
   PDF handling at `:451` and `:466`) using a result set with many columns.
   
   2. In `build_pdf_from_chart_data()` (`superset/utils/pdf.py:185`), each 
column gets pixel
   width (`:202-205`), then all widths are summed (`:206`) and directly used 
for `page_width`
   (`:207`) without limit.
   
   3. PIL allocates the full canvas with `Image.new("RGB", (page_width, 
page_height),
   "white")` (`superset/utils/pdf.py:221`), so very wide data creates very 
large memory
   allocations.
   
   4. The same path is used by scheduled chart-data PDFs
   (`superset/commands/report/execute.py:446-470` calls chart data URL with
   `result_format=PDF` at `:457`), so worker processes can OOM and fail 
requests/reports.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/utils/pdf.py
   **Line:** 206:208
   **Comment:**
        *Possible Bug: The computed page width grows directly with column count 
and can become extremely large, causing high memory allocation when creating 
PIL images and potentially crashing workers. Add an upper bound to the allowed 
page width and fail with a controlled error when input exceeds that limit.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=8d8e57567c00dd8deb6c2136dc6f519a6e7da594a20bb889eefdc780f4690e03&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=8d8e57567c00dd8deb6c2136dc6f519a6e7da594a20bb889eefdc780f4690e03&reaction=dislike'>👎</a>



##########
superset/commands/report/execute.py:
##########
@@ -436,6 +443,64 @@ def _get_pdf(self) -> bytes:
 
         return pdf
 
+    def _get_chart_data_pdf(self) -> bytes:
+        """
+        Get chart-data based pdf for chart reports.
+        :raises: ReportSchedulePdfFailedError
+        """
+        if not self._report_schedule.chart:
+            raise ReportSchedulePdfFailedError(
+                "Chart data pdf export is only supported for chart reports"
+            )
+
+        start_time = datetime.utcnow()
+        url = self._get_url(result_format=ChartDataResultFormat.PDF)
+        _, username = get_executor(
+            executors=app.config["ALERT_REPORTS_EXECUTORS"],
+            model=self._report_schedule,
+        )
+        user = security_manager.find_user(username)
+        auth_cookies = 
machine_auth_provider_factory.instance.get_auth_cookies(user)
+
+        if self._report_schedule.chart.query_context is None:
+            logger.warning("No query context found, taking a screenshot to 
generate it")
+            self._update_query_context()
+
+        try:
+            chart_data = get_chart_csv_data(chart_url=url, 
auth_cookies=auth_cookies)
+            elapsed_seconds = (datetime.utcnow() - start_time).total_seconds()
+            logger.info(
+                "Chart data PDF generation from %s as user %s took %.2fs - 
execution_id: %s",  # noqa: E501
+                url,
+                username,
+                elapsed_seconds,
+                self._execution_id,
+            )
+        except SoftTimeLimitExceeded as ex:
+            elapsed_seconds = (datetime.utcnow() - start_time).total_seconds()
+            logger.warning(
+                "Chart data PDF generation timeout after %.2fs - execution_id: 
%s",
+                elapsed_seconds,
+                self._execution_id,
+            )
+            raise ReportSchedulePdfFailedError(
+                "A timeout occurred while generating a pdf."
+            ) from ex
+        except Exception as ex:
+            elapsed_seconds = (datetime.utcnow() - start_time).total_seconds()
+            logger.error(
+                "Chart data PDF generation failed after %.2fs - execution_id: 
%s",
+                elapsed_seconds,
+                self._execution_id,
+            )
+            raise ReportSchedulePdfFailedError(
+                f"Failed generating pdf {str(ex)}"
+            ) from ex
+
+        if not chart_data:
+            raise ReportSchedulePdfFailedError()
+        return chart_data

Review Comment:
   **Suggestion:** The chart-data fetch result is returned as PDF bytes without 
validating the payload format. For multi-query charts, the chart-data API 
returns a ZIP, and this code will attach ZIP bytes as a `.pdf`, producing 
corrupted attachments and runtime integration failures in downstream consumers. 
[possible bug]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Multi-query PDF_NEW emails can attach invalid PDF files.
   - ❌ Webhook uploads mislabeled ZIP bytes as application/pdf.
   - ⚠️ Report run appears successful despite bad attachment.
   ```
   </details>
   
   ```suggestion
               chart_data = get_chart_csv_data(chart_url=url, 
auth_cookies=auth_cookies)
               elapsed_seconds = (datetime.utcnow() - 
start_time).total_seconds()
               logger.info(
                   "Chart data PDF generation from %s as user %s took %.2fs - 
execution_id: %s",  # noqa: E501
                   url,
                   username,
                   elapsed_seconds,
                   self._execution_id,
               )
               ...
               if not chart_data or not chart_data.startswith(b"%PDF-"):
                   raise ReportSchedulePdfFailedError(
                       "Chart data export did not return a valid PDF payload."
                   )
               return chart_data
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Trigger scheduled `PDF_NEW` chart report execution via `reports.execute`
   (`superset/tasks/scheduler.py:57-75`) so code reaches `_get_chart_data_pdf()`
   (`execute.py:446-502`).
   
   2. `_get_chart_data_pdf()` requests chart-data export URL with PDF format
   (`execute.py:457`) and fetches raw bytes using `get_chart_csv_data()`
   (`superset/utils/csv.py:15-29`), which does no content-type validation.
   
   3. For charts producing multiple query results, chart-data API returns a ZIP 
response in
   `_send_chart_response()` (`superset/charts/data/api.py:71-88`) even when 
`result_format ==
   PDF`.
   
   4. Current code only checks non-empty bytes (`execute.py:500-502`) and 
passes ZIP bytes as
   `NotificationContent.pdf`; email/webhook attach them as `.pdf`
   (`superset/reports/notifications/email.py:27-35`, `.../webhook.py:84`), 
yielding invalid
   PDF attachments.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/commands/report/execute.py
   **Line:** 470:502
   **Comment:**
        *Possible Bug: The chart-data fetch result is returned as PDF bytes 
without validating the payload format. For multi-query charts, the chart-data 
API returns a ZIP, and this code will attach ZIP bytes as a `.pdf`, producing 
corrupted attachments and runtime integration failures in downstream consumers.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=792a40c1e2ada82ad10929a0e71f1bc6dce2b44c2e94c967bd1de22daf79c849&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=792a40c1e2ada82ad10929a0e71f1bc6dce2b44c2e94c967bd1de22daf79c849&reaction=dislike'>👎</a>



##########
superset/commands/report/execute.py:
##########
@@ -611,6 +676,13 @@ def _get_notification_content(self) -> 
NotificationContent:  # noqa: C901
                 pdf_data = self._get_pdf()
                 if not pdf_data:
                     error_text = "Unexpected missing pdf"
+            elif self._report_schedule.report_format == 
ReportDataFormat.PDF_NEW:
+                if self._report_schedule.chart:
+                    pdf_data = self._get_chart_data_pdf()
+                    if not pdf_data:
+                        error_text = "Unexpected missing pdf"
+                else:
+                    error_text = "PDF NEW is only supported for chart reports"

Review Comment:
   **Suggestion:** When `PDF_NEW` is configured for a dashboard, the code only 
sets `error_text` and returns normal notification content instead of raising an 
exception. This makes the scheduler mark the run as successful even though 
report generation is invalid, causing silent misreporting of execution state. 
[logic error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Invalid dashboard PDF_NEW runs recorded as successful.
   - ⚠️ Monitoring misses real report configuration failures.
   - ⚠️ Operators receive error text, not failure state.
   ```
   </details>
   
   ```suggestion
                   else:
                       raise ReportSchedulePdfFailedError(
                           "PDF NEW is only supported for chart reports"
                       )
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Create or update a dashboard report through API `POST/PUT /api/v1/report`
   (`superset/reports/api.py:315-395`) with `report_format=PDF_NEW`; schemas 
accept any enum
   value (`superset/reports/schemas.py:228-231`, `:367-370`) without 
chart/dashboard
   compatibility validation.
   
   2. Scheduler executes it through `reports.execute` 
(`superset/tasks/scheduler.py:57-75`)
   and `send()` (`execute.py:792-799`) calls `_get_notification_content()`
   (`execute.py:653+`).
   
   3. In `PDF_NEW` branch for dashboard (`execute.py:679-685`), code sets 
`error_text`
   instead of raising.
   
   4. Function returns `NotificationContent` with text, `send()` completes, and 
state machine
   marks success (`execute.py:905-907` or `:1050-1051`) despite invalid report 
configuration.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/commands/report/execute.py
   **Line:** 684:685
   **Comment:**
        *Logic Error: When `PDF_NEW` is configured for a dashboard, the code 
only sets `error_text` and returns normal notification content instead of 
raising an exception. This makes the scheduler mark the run as successful even 
though report generation is invalid, causing silent misreporting of execution 
state.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=009665b086b44cd9745a02c1a0772e3165c06a455d467ab5b39ed7ada258040a&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38677&comment_hash=009665b086b44cd9745a02c1a0772e3165c06a455d467ab5b39ed7ada258040a&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: add PDF export for Superset chart exports and introduce new PDF… [superset]

Reply via email to