moonming opened a new pull request, #2016: URL: https://github.com/apache/apisix-website/pull/2016
## Summary Reduce the sitemap from ~5,200 URLs to ~2,700 by filtering out redundant versioned documentation pages, development docs, and low-value pages. Update robots.txt to match. ## Problem The sitemap includes every versioned doc page across 7 projects x 6 versions (3.10-3.15) + next. For example, `/docs/apisix/getting-started/` (latest) and `/docs/apisix/3.14/getting-started/` (old version) both appear. This wastes crawl budget and causes duplicate content confusion. Additionally, `/search`, `/blog/tags/`, and `/blog/page/` were being included in the sitemap despite being low-value pages. ## Changes ### 1. Sitemap merge script (`scripts/update-sitemap-loc.js`) Added URL filtering during post-build sitemap merge. Excludes: - `/docs/<project>/<version>/` - versioned doc pages - `/docs/<project>/next/` - unreleased dev docs - `/search`, `/blog/tags/`, `/blog/page/` Unversioned latest doc paths (e.g. `/docs/apisix/getting-started/`) are kept. ### 2. robots.txt (`website/static/robots.txt`) Added Disallow rules for all versioned doc paths, next docs, search, blog tags, and blog pagination across both locales. Ensures robots.txt and sitemap send consistent signals. ## Expected result - EN sitemap: ~2,638 -> ~1,360 URLs (~48% reduction) - ZH sitemap: ~2,620 -> ~1,340 URLs (~49% reduction) - Remaining URLs are high-value: latest docs, blog posts, main pages -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
