Re: [PR] #1542 - Migrate Documentation from Wiki to Living Documentation in Code [stormcrawler]

via GitHub Wed, 17 Dec 2025 22:52:16 -0800


rzo1 commented on code in PR #1714:
URL: https://github.com/apache/stormcrawler/pull/1714#discussion_r2629808937



##########
docs/src/main/asciidoc/overview.adoc:
##########
@@ -0,0 +1,52 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Overview
+
+Apache StormCrawler is an open source collection of resources for building 
low-latency, scalable web crawlers on link:http://storm.apache.org/[Apache 
Storm]. It is provided under the 
link:http://www.apache.org/licenses/LICENSE-2.0[Apache License] and is written 
mostly in Java.
+
+The aims of StormCrawler are to help build web crawlers that are:
+
+* Scalable
+* Low latency
+* Easy to extend
+* Polite yet efficient
+
+StormCrawler is both a library and a collection of reusable components 
designed to help developers build custom web crawlers with ease.
+Getting started is simple — the Maven archetypes allow you to quickly scaffold 
a new project, which you can then adapt to fit your specific needs.
+
+In addition to its core modules, StormCrawler offers a range of external 
resources that can be easily integrated into your project.
+These include spouts and bolts for OpenSearch, as well as a ParserBolt that 
leverages Apache Tika to handle various document formats and many more.
+
+StormCrawler is well-suited for scenarios where URLs to fetch and parse arrive 
as continuous streams, but it also performs exceptionally in large-scale, 
recursive crawls where low latency is essential.
+The project is actively maintained, widely adopted in production environments, 
and supported by an engaged community.
+
+You can find links to recent talks and demos later in this document, 
showcasing real-world applications and use cases.
+
+== Key Features
+
+Here is a short list of provided features:
+
+* Integration with 
link:https://github.com/crawler-commons/url-frontier[URLFrontier] for 
distributed URL management
+* Pluggable components (Spouts and Bolts from 
link:https://storm.apache.org/[Apache Storm]) for flexibility and modularity — 
adding custom components is straightforward
+* Support for link:https://tika.apache.org/[Apache Tika] for document parsing 
via `ParserBolt`
+* Integration with link:https://opensearch.org/[OpenSearch] and 
link:https://solr.apache.org/[Apache Solr] for indexing and status storage
+* Option to store crawled data as WARC (Web ARChive) files
+* Support for headless crawling using link:https://playwright.dev/[Playwright]
+* Support for LLM-based advanced text extraction
+* Proxy support for distributed and controlled crawling
+* Flexible and pluggable filtering mechanisms:
+** URL Filters for pre-fetch filtering
+** Parse Filters for post-fetch content filtering
+* Built-in support for crawl metrics and monitoring
+* Configurable politeness policies (e.g., crawl delay, user agent management)
+* Robust HTTP fetcher based on link:https://hc.apache.org/[Apache 
HttpComponents] or link:https://square.github.io/okhttp/[OkHttp].
+* MIME type detection and response-based filtering
+* Support for parsing and honoring `robots.txt` and sitemaps
+* Stream-based, real-time architecture using 
link:https://storm.apache.org/[Apache Storm] — suitable for both recursive and 
one-shot crawling tasks
+* Can run in both local and distributed environments
+* Apache Maven archetypes for quickly bootstrapping new crawler projects
+* Actively developed and used in production by link:powered-by.adoc[multiple 
organizations]

Review Comment:
   Thx. I think several links do not resolve at the moment. I will have a check 
as it seems configuration related.



##########
docs/src/main/asciidoc/overview.adoc:
##########
@@ -0,0 +1,52 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Overview
+
+Apache StormCrawler is an open source collection of resources for building 
low-latency, scalable web crawlers on link:http://storm.apache.org/[Apache 
Storm]. It is provided under the 
link:http://www.apache.org/licenses/LICENSE-2.0[Apache License] and is written 
mostly in Java.
+
+The aims of StormCrawler are to help build web crawlers that are:
+
+* Scalable
+* Low latency
+* Easy to extend
+* Polite yet efficient
+
+StormCrawler is both a library and a collection of reusable components 
designed to help developers build custom web crawlers with ease.
+Getting started is simple — the Maven archetypes allow you to quickly scaffold 
a new project, which you can then adapt to fit your specific needs.
+
+In addition to its core modules, StormCrawler offers a range of external 
resources that can be easily integrated into your project.
+These include spouts and bolts for OpenSearch, as well as a ParserBolt that 
leverages Apache Tika to handle various document formats and many more.
+
+StormCrawler is well-suited for scenarios where URLs to fetch and parse arrive 
as continuous streams, but it also performs exceptionally in large-scale, 
recursive crawls where low latency is essential.
+The project is actively maintained, widely adopted in production environments, 
and supported by an engaged community.
+
+You can find links to recent talks and demos later in this document, 
showcasing real-world applications and use cases.
+
+== Key Features
+
+Here is a short list of provided features:
+
+* Integration with 
link:https://github.com/crawler-commons/url-frontier[URLFrontier] for 
distributed URL management
+* Pluggable components (Spouts and Bolts from 
link:https://storm.apache.org/[Apache Storm]) for flexibility and modularity — 
adding custom components is straightforward
+* Support for link:https://tika.apache.org/[Apache Tika] for document parsing 
via `ParserBolt`
+* Integration with link:https://opensearch.org/[OpenSearch] and 
link:https://solr.apache.org/[Apache Solr] for indexing and status storage
+* Option to store crawled data as WARC (Web ARChive) files
+* Support for headless crawling using link:https://playwright.dev/[Playwright]
+* Support for LLM-based advanced text extraction
+* Proxy support for distributed and controlled crawling
+* Flexible and pluggable filtering mechanisms:
+** URL Filters for pre-fetch filtering
+** Parse Filters for post-fetch content filtering
+* Built-in support for crawl metrics and monitoring
+* Configurable politeness policies (e.g., crawl delay, user agent management)
+* Robust HTTP fetcher based on link:https://hc.apache.org/[Apache 
HttpComponents] or link:https://square.github.io/okhttp/[OkHttp].
+* MIME type detection and response-based filtering
+* Support for parsing and honoring `robots.txt` and sitemaps
+* Stream-based, real-time architecture using 
link:https://storm.apache.org/[Apache Storm] — suitable for both recursive and 
one-shot crawling tasks
+* Can run in both local and distributed environments
+* Apache Maven archetypes for quickly bootstrapping new crawler projects
+* Actively developed and used in production by link:powered-by.adoc[multiple 
organizations]

Review Comment:
   Thx. I think several links do not resolve at the moment. I will have a check 
as it seems configuration related.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] #1542 - Migrate Documentation from Wiki to Living Documentation in Code [stormcrawler]

Reply via email to