kevinjqliu opened a new issue, #178:
URL: https://github.com/apache/datafusion-site/issues/178

   ## Problem
   
   Some blog post URLs redirect through an internal `/output/` path that should 
never be exposed to users.
   
   **Working URLs** (slugs without dots):
   - https://datafusion.apache.org/blog/2026/03/31/writing-table-providers → 
301 → `/blog/2026/03/31/writing-table-providers/` ✅
   
   **Broken URLs** (slugs with version numbers):
   - https://datafusion.apache.org/blog/2026/04/18/datafusion-comet-0.15.0 → 
301 → `/blog/output/2026/04/18/datafusion-comet-0.15.0/` ❌
   - https://datafusion.apache.org/blog/2026/04/02/datafusion-53.0.0 → 301 → 
`/blog/output/2026/04/02/datafusion-53.0.0/` ❌
   - https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0 → 301 → 
`/blog/output/2024/07/24/datafusion-40.0.0/` ❌
   
   Every blog post with a version number in its slug (e.g. 
`datafusion-comet-0.15.0`) is affected. The `/output/` URL still serves the 
page, but is not the canonical URL and causes subtle issues — for example, 
giscus CSP was not applying correctly on the `/output/` path (see 
https://github.com/apache/datafusion-site/issues/80#issuecomment-4416405278).
   
   ## Root Cause
   
   There is a mismatch between `publish-site.yml` and `.asf.yaml`:
   
   - `.asf.yaml` on `asf-site` declares `subdir: blog`, telling ASF 
infrastructure to serve content from a `blog/` subdirectory.
   - `publish-site.yml` does **not** set `output: 'blog'`, so the pelican 
action defaults to `output: 'output'`, putting built content into `output/` 
instead of `blog/`.
   
   ```yaml
   # .asf.yaml (expects blog/)
   publish:
     whoami: asf-site
     subdir: blog
   
   # publish-site.yml (produces output/)
   - uses: apache/infrastructure-actions/pelican@main
     with:
       destination: 'asf-site'
       gfm: 'false'
       # output: 'blog'   <-- MISSING
   ```
   
   The staging workflow (`stage-site.yml`) already has `output: 'blog'` and 
works correctly.
   
   To bridge this mismatch, `.htaccess` on the `asf-site` branch has rewrite 
rules that internally map requests from `blog/` to `output/`:
   
   ```apache
   RewriteCond %{REQUEST_URI} !/output/
   RewriteRule ^(.*)$ output/$1 [L]
   ```
   
   These rules also add a trailing-slash redirect for extensionless URLs, but 
skip URLs that "look like files":
   
   ```apache
   RewriteCond %1 !\.[^./]+$
   ```
   
   The regex `\.[^./]+$` matches any dot followed by non-dot/non-slash 
characters at the end of the URL. This incorrectly matches `.0` in 
version-number slugs like `datafusion-comet-0.15.0`, causing the trailing-slash 
redirect to be skipped. Apache's `mod_dir` then adds the trailing slash itself, 
but exposes the internal `output/` prefix in the redirect Location header.
   
   ## Fix
   
   Add `output: 'blog'` to `publish-site.yml` to match `stage-site.yml`:
   
   ```yaml
   - uses: apache/infrastructure-actions/pelican@main
     with:
       destination: 'asf-site'
       gfm: 'false'
       output: 'blog'
   ```
   
   This puts build output into `blog/` on the `asf-site` branch, matching what 
`.asf.yaml` expects. The `.htaccess` rewrite rules become unnecessary.
   
   ## Follow-up (after deploy)
   
   After the first successful deploy with this fix, a separate PR to the 
`asf-site` branch should:
   
   1. Remove the stale `output/` directory (all content will now be in `blog/`).
   2. Simplify `.htaccess` to remove the rewrite rules, keeping only the CSP 
directive:
   
   ```apache
   SetEnv CSP_PROJECT_DOMAINS "https://giscus.app";
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to