Re: [PR] #1542 - Migrate Documentation from Wiki to Living Documentation in Code [stormcrawler]

via GitHub Tue, 16 Dec 2025 11:49:11 -0800


sebastian-nagel commented on code in PR #1714:
URL: https://github.com/apache/stormcrawler/pull/1714#discussion_r2624470216



##########
docs/src/main/asciidoc/powered-by.adoc:
##########
@@ -0,0 +1,39 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Companies & Projects Using StormCrawler
+
+Apache StormCrawler has been adopted by a wide variety of organizations across 
industries, from startups to large enterprises and research institutions.
+The following is a non-exhaustive list of companies, projects, and 
institutions that have used Apache StormCrawler in production or research.
+If your organization is also making use of Apache StormCrawler, we’d love to 
hear from you!
+
+* link:http://www.careerbuilder.com/[CareerBuilder]
+* link:http://www.stolencamerafinder.com/[StolenCameraFinder]
+* link:http://www.weborama.com/[Weborama]
+* link:http://www.ontopic.io/[Ontopic]
+* link:http://www.shopstyle.com/[ShopStyle]
+* link:http://www.wombatsoftware.de/[Wombat Software]
+* link:http://commoncrawl.org/2016/10/news-dataset-available/[CommonCrawl]
+* link:https://webfinery.com/[WebFinery]
+* link:http://www.reportlinker.com/[ReportLinker]
+* link:http://www.tokenmill.lt/[TokenMill]
+* link:http://www.polecat.com/[Polecat]
+* link:http://www.wizenoze.com/en/[WizeNoze]
+* link:http://iproduct.io/[IProduct.io]
+* link:https://www.cgi.com/[CGI]
+* link:https://github.com/miras-tech/MirasText[MirasText]
+* link:https://www.g2webservices.com/[G2 Web Services]
+* link:https://www.gov.nt.ca/[Government of Northwest Territories]
+* 
link:https://digitalpebble.blogspot.com/2019/02/meet-stormcrawler-users-q-with-pixray.html[Pixray]
+* link:https://www.cameraforensics.com/[CameraForensics]
+* link:https://gagepiracy.com/[Gage Piracy]
+* link:https://www.clarin.eu/[Clarin ERIC]
+* link:https://openwebsearch.eu/owler/[OpenWebSearch]
+* link:https://shc-info.zml.hs-heilbronn.de/[Heilbronn University]
+* link:https://www.contexity.com[Contexity]
+* 
link:https://https://www.kodis.iao.fraunhofer.de/de/projekte/SPIDERWISE.html[Fraunhofer
 IAO - KODIS]

Review Comment:
   URL syntax error.



##########
docs/src/main/asciidoc/overview.adoc:
##########
@@ -0,0 +1,52 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Overview
+
+Apache StormCrawler is an open source collection of resources for building 
low-latency, scalable web crawlers on link:http://storm.apache.org/[Apache 
Storm]. It is provided under the 
link:http://www.apache.org/licenses/LICENSE-2.0[Apache License] and is written 
mostly in Java.

Review Comment:
   Might upgrade `http:` links to `https:`.



##########
docs/src/main/asciidoc/overview.adoc:
##########
@@ -0,0 +1,52 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Overview
+
+Apache StormCrawler is an open source collection of resources for building 
low-latency, scalable web crawlers on link:http://storm.apache.org/[Apache 
Storm]. It is provided under the 
link:http://www.apache.org/licenses/LICENSE-2.0[Apache License] and is written 
mostly in Java.
+
+The aims of StormCrawler are to help build web crawlers that are:
+
+* Scalable
+* Low latency
+* Easy to extend
+* Polite yet efficient
+
+StormCrawler is both a library and a collection of reusable components 
designed to help developers build custom web crawlers with ease.
+Getting started is simple — the Maven archetypes allow you to quickly scaffold 
a new project, which you can then adapt to fit your specific needs.
+
+In addition to its core modules, StormCrawler offers a range of external 
resources that can be easily integrated into your project.
+These include spouts and bolts for OpenSearch, as well as a ParserBolt that 
leverages Apache Tika to handle various document formats and many more.
+
+StormCrawler is well-suited for scenarios where URLs to fetch and parse arrive 
as continuous streams, but it also performs exceptionally in large-scale, 
recursive crawls where low latency is essential.
+The project is actively maintained, widely adopted in production environments, 
and supported by an engaged community.
+
+You can find links to recent talks and demos later in this document, 
showcasing real-world applications and use cases.
+
+== Key Features
+
+Here is a short list of provided features:
+
+* Integration with 
link:https://github.com/crawler-commons/url-frontier[URLFrontier] for 
distributed URL management
+* Pluggable components (Spouts and Bolts from 
link:https://storm.apache.org/[Apache Storm]) for flexibility and modularity — 
adding custom components is straightforward
+* Support for link:https://tika.apache.org/[Apache Tika] for document parsing 
via `ParserBolt`
+* Integration with link:https://opensearch.org/[OpenSearch] and 
link:https://solr.apache.org/[Apache Solr] for indexing and status storage
+* Option to store crawled data as WARC (Web ARChive) files
+* Support for headless crawling using link:https://playwright.dev/[Playwright]
+* Support for LLM-based advanced text extraction
+* Proxy support for distributed and controlled crawling
+* Flexible and pluggable filtering mechanisms:
+** URL Filters for pre-fetch filtering
+** Parse Filters for post-fetch content filtering
+* Built-in support for crawl metrics and monitoring
+* Configurable politeness policies (e.g., crawl delay, user agent management)
+* Robust HTTP fetcher based on link:https://hc.apache.org/[Apache 
HttpComponents] or link:https://square.github.io/okhttp/[OkHttp].
+* MIME type detection and response-based filtering
+* Support for parsing and honoring `robots.txt` and sitemaps
+* Stream-based, real-time architecture using 
link:https://storm.apache.org/[Apache Storm] — suitable for both recursive and 
one-shot crawling tasks
+* Can run in both local and distributed environments
+* Apache Maven archetypes for quickly bootstrapping new crawler projects
+* Actively developed and used in production by link:powered-by.adoc[multiple 
organizations]

Review Comment:
   In the live deployment the link does not resolve.



##########
docs/src/main/asciidoc/configuration.adoc:
##########
@@ -0,0 +1,316 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Configuration
+
+=== User Agent Configuration
+
+Crawlers should always act responsibly and ethically when accessing websites. 
A key aspect of this is properly identifying themselves through the 
`User-Agent` header. By providing a clear and accurate user agent string, 
webmasters can understand who is visiting their site and why, and can apply 
rules in robots.txt accordingly. Respecting these rules, avoiding excessive 
request rates, and honoring content restrictions not only ensures legal 
compliance but also maintains a healthy relationship with the web community.
+Transparent identification is a fundamental part of ethical web crawling.
+
+The configuration of the 
link:https://www.w3.org/WAI/UA/work/wiki/Definition_of_User_Agent[user agent] 
in StormCrawler has two purposes:
+
+. Identification of the crawler for webmasters
+. Selection of rules from robots.txt
+
+==== Crawler Identification
+
+The politeness of a web crawler is not limited to how frequently it fetches 
pages from a site, but also in how it identifies itself to sites it crawls. 
This is done by setting the HTTP header `User-Agent`, just like 
link:https://www.whatismybrowser.com/detect/what-is-my-user-agent/[your web 
browser does].
+
+The full user agent string is built from the concatenation of the 
configuration elements:
+
+* `http.agent.name`: name of your crawler
+* `http.agent.version`: version of your crawler
+* `http.agent.description`: description of what it does
+* `http.agent.url`: URL webmasters can go to to learn about it
+* `http.agent.email`: an email so that they can get in touch with you
+
+Whereas StormCrawler used to provide a default value for these, this is not 
the case since version 2.11 and you will now be asked to provide a value.
+
+You can specify the user agent verbatim with the config `http.agent` but you 
will still need to provide a `http.agent.name` for parsing robots.txt files.
+
+==== Robots Exclusion Protocol
+
+This is also known as the robots.txt protocol, it is formalised in 
link:https://www.rfc-editor.org/rfc/rfc9309.html[RFC 9309]. Part of what the 
robots directives do is to define rules to specify which parts of a website (if 
any) are allowed to be crawled. The rules are organised by `User-Agent`, with a 
`*` to match any agent not otherwise specified explicitly, e.g.:
+
+----
+User-Agent: *
+Disallow: *.gif$
+Disallow: /example/
+Allow: /publications/
+----
+
+In the example above the rule allows access to the URLs with the 
_/publications/_ path prefix, and it restricts access to the URLs with the 
_/example/_ path prefix and to all URLs with a _.gif_ suffix. The `"*"` 
character designates any character, including the otherwise-required forward 
slash.
+
+The value of `http.agent.name` is what StormCrawler looks for in the 
robots.txt. It MUST contain only uppercase and lowercase letters ("a-z" and 
"A-Z"), underscores ("_"), and hyphens ("-").
+
+Unless you are running a well known web crawler, it is unlikely that its agent 
name will be listed explicitly in the robots.txt (if it is, well, 
congratulations!). While you want the agent name value to reflect who your 
crawler is, you might want to follow rules set for better known crawlers. For 
instance, if you were a responsible AI company crawling the web to build a 
dataset to train LLMs, you would want to follow the rules set for 
`Google-Extended` (see 
link:https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers[list
 of Google crawlers]) if any were found.
+
+This is what the configuration `http.robots.agents` allows you to do. It is a 
comma-separated string but can also take a list of values. By setting it 
alongside `http.agent.name` (which should also be the first value it contains), 
you are able to broaden the match rules based on the identity as well as the 
purpose of your crawler.
+
+=== Proxy 
+
+StormCrawler's proxy system is built on top of the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SCProxy.java[SCProxy]
 class and the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/ProxyManager.java[ProxyManager]
 interface. Every proxy used in the system is formatted as a **SCProxy**. The 
**ProxyManager** implementations handle the management and delegation of their 
internal proxies. At the call of 
link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/protocol/Protocol.html#getProtocolOutput-java.lang.String-org.apache.stormcrawler.Metadata-[HTTPProtocol.getProtocolOutput()],
 the 
link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/proxy/ProxyManager.html#getProxy[ProxyManager.getProxy()]
 is called to retrieve a proxy for the individual request.
+
+The **ProxyManager** interface can be implemented in a custom class to create 
custom logic for proxy management and load balancing. The default 
**ProxyManager** implementation is **SingleProxyManager**. This ensures 
backwards compatibility for prior StormCrawler releases. To use 
**MultiProxyManager** or custom implementations, pass the class path and name 
via the config parameter `http.proxy.manager`:
+
+----
+http.proxy.manager: "org.apache.stormcrawler.proxy.MultiProxyManager"
+----
+
+StormCrawler implements two **ProxyManager** classes by default:
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/SingleProxyManager.java[SingleProxyManager]
+Manages a single proxy passed by the backwards compatible proxy fields in the 
configuration:
+
+  ----
+  http.proxy.host  
+  http.proxy.port  
+  http.proxy.type  
+  http.proxy.user (optional)  
+  http.proxy.pass (optional)  
+  ----
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/proxy/MultiProxyManager.java[MultiProxyManager]
+Manages multiple proxies passed through a TXT file. The file should contain 
connection strings for all proxies including the protocol and authentication 
(if needed). The file supports comment lines (`//` or `#`) and empty lines. The 
file path should be passed via the config at the below field. The TXT file must 
be available to all nodes participating in the topology:
+
+  ----
+  http.proxy.file
+  ----
+
+The **MultiProxyManager** load balances across proxies using one of the 
following schemes. The load balancing scheme can be passed via the config using 
`http.proxy.rotation`; the default value is `ROUND_ROBIN`:
+
+* ROUND_ROBIN
+Evenly distributes load across all proxies
+* RANDOM
+Randomly selects proxies using the native Java random number generator. RNG is 
seeded with the nanos at instantiation
+* LEAST_USED
+Selects the proxy with the least amount of usage. This is performed lazily for 
speed and therefore will not account for changes to usages during the selection 
process. If no custom implementations are made this should theoretically 
operate the same as **ROUND_ROBIN**
+
+The **SCProxy** class contains all of the information associated with proxy 
connection. In addition, it tracks the total usage of the proxy and optionally 
tracks the location of the proxy IP. Usage information is used for the 
**LEAST_USED** load balancing scheme. The location information is currently 
unused but left to enable custom implementations the ability to select proxies 
by location.
+
+=== Metadata
+
+==== Registering Metadata for Kryo Serialization
+
+If your Apache StormCrawler topology doesn't extend 
`org.apache.storm.crawler.ConfigurableTopology`, you will need to manually 
register StormCrawler's `Metadata` class for serialization in Storm. For more 
information on Kryo serialization in Apache Storm, you can refer to the 
link:https://storm.apache.org/documentation/Serialization.html[documentation].
+
+To register `Metadata` for serialization, you'll need to import 
`backtype.storm.Config` and `org.apache.storm.crawler.Metadata`. Then, in your 
topology class, you'll register the class with:
+
+[source,java]
+----
+Config.registerSerialization(conf, Metadata.class);
+----
+
+where `conf` is your Storm configuration for the topology.
+
+Alternatively, you can specify in the configuration file:
+
+[source,yaml]
+----
+topology.kryo.register:
+  - org.apache.storm.crawler.Metadata
+----
+
+==== MetadataTransfer
+
+The class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/util/MetadataTransfer.java[MetadataTransfer]
 is an important part of the framework and is used in key parts of a pipeline:
+
+* Fetching
+* Parsing
+* Updating bolts
+
+An instance (or extension) of **MetadataTransfer** gets created and configured 
with the method:
+
+[source,java]
+----
+public static MetadataTransfer getInstance(Map<String, Object> conf)
+----
+
+which takes as parameter the standard Storm [[Configuration]].
+
+A **MetadataTransfer** instance has mainly two methods, both returning 
`Metadata` objects:
+
+* `getMetaForOutlink(String targetURL, String sourceURL, Metadata parentMD)`
+* `filter(Metadata metadata)`
+
+The former is used when creating 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/parse/Outlink.java[Outlinks],
 i.e., in the parsing bolts but also for handling redirections in the 
[[FetcherBolt(s)]].
+
+The latter is used by extensions of the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
 class to determine which **Metadata** should be persisted.
+
+The behavior of the default **MetadataTransfer** class is driven by 
configuration only. It has the following options:
+
+* `metadata.transfer`:: list of metadata key values to filter or transfer to 
the outlinks. See 
link:https://github.com/DigitalPebble/storm-crawler/blob/main/core/src/main/resources/crawler-default.yaml#L23[crawler-default.yaml]

Review Comment:
   (also applies to the following item)
   The links to specific lines in `crawler-default.yaml` are outdated, the line 
numbers have changed.
   Maybe replace with a generic "Please see the corresponding comments in 
link:...[crawler-default.yaml]"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] #1542 - Migrate Documentation from Wiki to Living Documentation in Code [stormcrawler]

Reply via email to