mvolikas commented on code in PR #1714:
URL: https://github.com/apache/stormcrawler/pull/1714#discussion_r2647121833


##########
docs/src/main/asciidoc/quick-start.adoc:
##########
@@ -0,0 +1,211 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Quick Start
+
+These instructions should help you get Apache StormCrawler up and running in 5 
to 15 minutes.
+
+=== Prerequisites
+
+To run StormCrawler, you will need Java SE 17 or later.
+
+Additionally, since we'll be running the required Apache Storm cluster using 
Docker Compose,
+make sure Docker is installed on your operating system.
+
+=== Terminology
+
+Before starting, we will give a quick overview of **central** Storm concepts 
and terminology, you need to know before starting with StormCrawler:
+
+- *Topology*: A topology is the overall data processing graph in Storm, 
consisting of spouts and bolts connected together to perform continuous, 
real-time computations.
+
+- *Spout*: A spout is a source component in a Storm topology that emits 
streams of data into the processing pipeline.
+
+- *Bolt*: A bolt processes, transforms, or routes data streams emitted by 
spouts or other bolts within the topology.
+
+- *Flux*: In Apache Storm, Flux is a declarative configuration framework that 
enables you to define and run Storm topologies using YAML files instead of 
writing Java code. This simplifies topology management and deployment.
+
+- *Frontier*: In the context of a web crawler, the Frontier is the component 
responsible for managing and prioritizing the list of URLs to be fetched next.
+
+- *Seed*: In web crawling, a seed is an initial URL or set of URLs from which 
the crawler starts its discovery and fetching process.
+
+=== Bootstrapping a StormCrawler Project
+
+You can quickly generate a new StormCrawler project using the Maven archetype:
+
+[source,shell]
+----
+mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler \
+                       -DarchetypeArtifactId=stormcrawler-archetype \
+                       -DarchetypeVersion=<CURRENT_VERSION>
+----
+
+Be sure to replace `<CURRENT_VERSION>` with the latest released version of 
StormCrawler, which you can find on 
link:https://search.maven.org/artifact/org.apache.stormcrawler/stormcrawler-archetype[search.maven.org].
+
+During the process, you’ll be prompted to provide the following:
+
+* `groupId` (e.g. `com.mycompany.crawler`)
+* `artifactId` (e.g. `stormcrawler`)
+* Version
+* Package name
+* User agent details
+
+IMPORTANT: Specifying a user agent is important for crawler ethics because it 
identifies your crawler to websites, promoting transparency and allowing site 
owners to manage or block requests if needed. Be sure to provide a crawler 
information website as well.
+
+The archetype will generate a fully-structured project including:
+
+* A pre-configured `pom.xml` with the necessary dependencies
+* Default resource files
+* A sample `crawler.flux` configuration
+* A basic configuration file
+
+After generation, navigate into the newly created directory (named after the 
`artifactId` you specified).
+
+TIP: You can learn more about the architecture and how each component works 
together if you look into link:architecture.adoc[the architecture 
documentation].

Review Comment:
   The link to the 'Architecture' page seems broken in the live site. We could 
use an `xref:` instead, like in a previous link to 'Powered by'.



##########
docs/src/main/asciidoc/internals.adoc:
##########
@@ -0,0 +1,443 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Understanding StormCrawler's Internals
+
+=== Status Stream
+
+The Apache StormCrawler components rely on two Apache Storm streams: the 
_default_ one and another one called _status_.
+
+The aim of the _status_ stream is to pass information about URLs to a 
persistence layer. Typically, a bespoke bolt will take the tuples coming from 
the _status_ stream and update the information about URLs in some sort of 
storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to 
send new URLs down the topology.
+
+This is critical for building recursive crawls (i.e., you discover new URLs 
and not just process known ones). The _default_ stream is used for the URL 
being processed and is generally used at the end of the pipeline by an indexing 
bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether 
the crawler is recursive or not.
+
+Tuples are emitted on the _status_ stream by the parsing bolts for handling 
outlinks but also to notify that there has been a problem with a URL (e.g., 
unparsable content). It is also used by the fetching bolts to handle 
redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400).
+
+A bolt which sends tuples on the _status_ stream declares its output in the 
following way:
+
+[source,java]
+----
+declarer.declareStream(
+    org.apache.storm.crawler.Constants.StatusStreamName,
+    new Fields("url", "metadata", "status"));
+----
+
+As you can see for instance in 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java#L149[SimpleFetcherBolt].
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status]
 enum has the following values:
+
+* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the 
topology by one of the 
link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts]
 or "injected" into the storage. The URLs can be already known in the storage.
+* REDIRECTION:: set by the fetcher bolts.
+* FETCH_ERROR:: set by the fetcher bolts.
+* ERROR:: used by either the fetcher, parser, or indexer bolts.
+* FETCHED:: set by the StatusStreamBolt bolt (see below).
+
+The difference between FETCH_ERROR and ERROR is that the former is possibly 
transient whereas the latter is terminal. The bolt which is in charge of 
updating the status (see below) can then decide when and whether to schedule a 
new fetch for a URL based on the status value.
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer]
 is useful for notifying the storage layer that a URL has been successfully 
processed, i.e., fetched, parsed, and anything else we want to do with the main 
content. It must be placed just before the StatusUpdaterBolt and sends a tuple 
for the URL on the status stream with a Status value of `fetched`.
+
+The class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
 can be extended to handle status updates for a specific backend. It has an 
internal cache of URLs with a `discovered` status so that they don't get added 
to the backend if they already exist, which is a simple but efficient 
optimisation. It also uses 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/DefaultScheduler.java[DefaultScheduler]
 to compute a next fetch date and calls MetadataTransfer to filter the metadata 
that will be stored in the backend.
+
+In most cases, the extending classes will just need to implement the method 
`store(String URL, Status status, Metadata metadata, Date nextFetch)` and 
handle their own initialisation in `prepare()`. You can find an example of a 
class which extends it in the 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt]
 for Elasticsearch.
+
+
+=== Bolts
+
+==== Fetcher Bolts
+
+There are actually two different bolts for fetching the content of URLs:
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java[SimpleFetcherBolt]
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java[FetcherBolt]
+
+Both declare the same output:
+
+[source,java]
+----
+declarer.declare(new Fields("url", "content", "metadata"));
+declarer.declareStream(
+        org.apache.storm.crawler.Constants.StatusStreamName,
+        new Fields("url", "metadata", "status"));
+----
+
+with the `StatusStream` being used for handling redirections, restrictions by 
robots directives, or fetch errors, whereas the default stream gets the binary 
content returned by the server as well as the metadata to the following 
components (typically a parsing bolt).
+
+Both use the same xref:protocols[Protocols] implementations and 
xref:urlfilters[URLFilters] to control the redirections.
+
+The **FetcherBolt** has an internal set of queues where the incoming URLs are 
placed based on their hostname/domain/IP (see config `fetcher.queue.mode`) and 
a number of **FetchingThreads** (config `fetcher.threads.number` – 10 by 
default) which pull the URLs to fetch from the **FetchQueues**. When doing so, 
they make sure that a minimal amount of time (set with `fetcher.server.delay` – 
default 1 sec) has passed since the previous URL was fetched from the same 
queue. This mechanism ensures that we can control the rate at which requests 
are sent to the servers. A **FetchQueue** can also be used by more than one 
**FetchingThread** at a time (in which case `fetcher.server.min.delay` is 
used), based on the value of `fetcher.threads.per.queue`.
+
+Incoming tuples spend very little time in the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute]
 method of the **FetcherBolt** as they are put in the FetchQueues, which is why 
you'll find that the value of **Execute latency** in the Storm UI is pretty 
low. They get acked later on, after they've been fetched. The metric to watch 
for in the Storm UI is **Process latency**.
+
+The **SimpleFetcherBolt** does not do any of this, hence its name. It just 
fetches incoming tuples in its `execute` method and does not do 
multi-threading. It does enforce politeness by checking when a URL can be 
fetched and will wait until it is the case. It is up to the user to declare 
multiple instances of the bolt in the Topology class and to manage how the URLs 
get distributed across the instances of **SimpleFetcherBolt**, often with the 
help of the link:https:/

Review Comment:
   Possibly a broken link here?



##########
docs/src/main/asciidoc/internals.adoc:
##########
@@ -0,0 +1,443 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Understanding StormCrawler's Internals
+
+=== Status Stream
+
+The Apache StormCrawler components rely on two Apache Storm streams: the 
_default_ one and another one called _status_.
+
+The aim of the _status_ stream is to pass information about URLs to a 
persistence layer. Typically, a bespoke bolt will take the tuples coming from 
the _status_ stream and update the information about URLs in some sort of 
storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to 
send new URLs down the topology.
+
+This is critical for building recursive crawls (i.e., you discover new URLs 
and not just process known ones). The _default_ stream is used for the URL 
being processed and is generally used at the end of the pipeline by an indexing 
bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether 
the crawler is recursive or not.
+
+Tuples are emitted on the _status_ stream by the parsing bolts for handling 
outlinks but also to notify that there has been a problem with a URL (e.g., 
unparsable content). It is also used by the fetching bolts to handle 
redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400).
+
+A bolt which sends tuples on the _status_ stream declares its output in the 
following way:
+
+[source,java]
+----
+declarer.declareStream(
+    org.apache.storm.crawler.Constants.StatusStreamName,
+    new Fields("url", "metadata", "status"));
+----
+
+As you can see for instance in 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java#L149[SimpleFetcherBolt].
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status]
 enum has the following values:
+
+* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the 
topology by one of the 
link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts]
 or "injected" into the storage. The URLs can be already known in the storage.
+* REDIRECTION:: set by the fetcher bolts.
+* FETCH_ERROR:: set by the fetcher bolts.
+* ERROR:: used by either the fetcher, parser, or indexer bolts.
+* FETCHED:: set by the StatusStreamBolt bolt (see below).
+
+The difference between FETCH_ERROR and ERROR is that the former is possibly 
transient whereas the latter is terminal. The bolt which is in charge of 
updating the status (see below) can then decide when and whether to schedule a 
new fetch for a URL based on the status value.
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer]
 is useful for notifying the storage layer that a URL has been successfully 
processed, i.e., fetched, parsed, and anything else we want to do with the main 
content. It must be placed just before the StatusUpdaterBolt and sends a tuple 
for the URL on the status stream with a Status value of `fetched`.
+
+The class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
 can be extended to handle status updates for a specific backend. It has an 
internal cache of URLs with a `discovered` status so that they don't get added 
to the backend if they already exist, which is a simple but efficient 
optimisation. It also uses 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/DefaultScheduler.java[DefaultScheduler]
 to compute a next fetch date and calls MetadataTransfer to filter the metadata 
that will be stored in the backend.
+
+In most cases, the extending classes will just need to implement the method 
`store(String URL, Status status, Metadata metadata, Date nextFetch)` and 
handle their own initialisation in `prepare()`. You can find an example of a 
class which extends it in the 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt]
 for Elasticsearch.
+
+
+=== Bolts
+
+==== Fetcher Bolts
+
+There are actually two different bolts for fetching the content of URLs:
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java[SimpleFetcherBolt]
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java[FetcherBolt]
+
+Both declare the same output:
+
+[source,java]
+----
+declarer.declare(new Fields("url", "content", "metadata"));
+declarer.declareStream(
+        org.apache.storm.crawler.Constants.StatusStreamName,
+        new Fields("url", "metadata", "status"));
+----
+
+with the `StatusStream` being used for handling redirections, restrictions by 
robots directives, or fetch errors, whereas the default stream gets the binary 
content returned by the server as well as the metadata to the following 
components (typically a parsing bolt).
+
+Both use the same xref:protocols[Protocols] implementations and 
xref:urlfilters[URLFilters] to control the redirections.
+
+The **FetcherBolt** has an internal set of queues where the incoming URLs are 
placed based on their hostname/domain/IP (see config `fetcher.queue.mode`) and 
a number of **FetchingThreads** (config `fetcher.threads.number` – 10 by 
default) which pull the URLs to fetch from the **FetchQueues**. When doing so, 
they make sure that a minimal amount of time (set with `fetcher.server.delay` – 
default 1 sec) has passed since the previous URL was fetched from the same 
queue. This mechanism ensures that we can control the rate at which requests 
are sent to the servers. A **FetchQueue** can also be used by more than one 
**FetchingThread** at a time (in which case `fetcher.server.min.delay` is 
used), based on the value of `fetcher.threads.per.queue`.
+
+Incoming tuples spend very little time in the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute]
 method of the **FetcherBolt** as they are put in the FetchQueues, which is why 
you'll find that the value of **Execute latency** in the Storm UI is pretty 
low. They get acked later on, after they've been fetched. The metric to watch 
for in the Storm UI is **Process latency**.
+
+The **SimpleFetcherBolt** does not do any of this, hence its name. It just 
fetches incoming tuples in its `execute` method and does not do 
multi-threading. It does enforce politeness by checking when a URL can be 
fetched and will wait until it is the case. It is up to the user to declare 
multiple instances of the bolt in the Topology class and to manage how the URLs 
get distributed across the instances of **SimpleFetcherBolt**, often with the 
help of the link:https:/
+
+=== Indexer Bolts
+The purpose of crawlers is often to index web pages to make them searchable. 
The project contains resources for indexing with popular search solutions such 
as:
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/com/digitalpebble/stormcrawler/solr/bolt/IndexerBolt.java[Apache
 SOLR]
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java[Elasticsearch]
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/com/digitalpebble/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS
 CloudSearch]
+
+All of these extend the class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java[AbstractIndexerBolt].
+
+The core module also contains a 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/StdOutIndexer.java[simple
 indexer] which dumps the documents into the standard output – useful for 
debugging – as well as a 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer].
+
+The basic functionalities of filtering a document to index, mapping the 
metadata (which determines which metadata to keep for indexing and under what 
field name), or using the canonical tag (if any) are handled by the abstract 
class. This allows implementations to focus on communication with the indexing 
APIs.
+
+Indexing is often the penultimate component in a pipeline and takes the output 
of a Parsing bolt on the standard stream. The output of the indexing bolts is 
on the _status_ stream:
+
+[source,java]
+----
+public void declareOutputFields(OutputFieldsDeclarer declarer) {
+    declarer.declareStream(
+            org.apache.stormcrawler.Constants.StatusStreamName,
+            new Fields("url", "metadata", "status"));
+}
+----
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer]
 is used for cases where no actual indexing is required. It simply generates a 
tuple on the _status_ stream so that any StatusUpdater bolt knows that the URL 
was processed successfully and can update its status and scheduling in the 
corresponding backend.
+
+You can easily build your own custom indexer to integrate with other storage 
systems, such as a vector database for semantic search, a graph database for 
network analysis, or any other specialized data store. By extending 
AbstractIndexerBolt, you only need to implement the logic to communicate with 
your target system, while StormCrawler handles the rest of the pipeline and 
status updates.
+
+=== Parser Bolts
+==== JSoupParserBolt
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt]
 can be used to parse HTML documents and extract the outlinks, text, and 
metadata it contains. If you want to parse non-HTML documents, use the 
link:https://github.com/apache/stormcrawler/tree/main/external/src/main/java/com/digitalpebble/storm/crawler/tika[Tika-based
 ParserBolt] from the external modules.

Review Comment:
   This link seems broken. We should probably have to change all the 
digitalpebble links.



##########
docs/src/main/asciidoc/internals.adoc:
##########
@@ -0,0 +1,443 @@
+////
+Licensed under the Apache License, Version 2.0 (the "License");
+You may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+https://www.apache.org/licenses/LICENSE-2.0
+////
+== Understanding StormCrawler's Internals
+
+=== Status Stream
+
+The Apache StormCrawler components rely on two Apache Storm streams: the 
_default_ one and another one called _status_.
+
+The aim of the _status_ stream is to pass information about URLs to a 
persistence layer. Typically, a bespoke bolt will take the tuples coming from 
the _status_ stream and update the information about URLs in some sort of 
storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to 
send new URLs down the topology.
+
+This is critical for building recursive crawls (i.e., you discover new URLs 
and not just process known ones). The _default_ stream is used for the URL 
being processed and is generally used at the end of the pipeline by an indexing 
bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether 
the crawler is recursive or not.
+
+Tuples are emitted on the _status_ stream by the parsing bolts for handling 
outlinks but also to notify that there has been a problem with a URL (e.g., 
unparsable content). It is also used by the fetching bolts to handle 
redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400).
+
+A bolt which sends tuples on the _status_ stream declares its output in the 
following way:
+
+[source,java]
+----
+declarer.declareStream(
+    org.apache.storm.crawler.Constants.StatusStreamName,
+    new Fields("url", "metadata", "status"));
+----
+
+As you can see for instance in 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java#L149[SimpleFetcherBolt].
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status]
 enum has the following values:
+
+* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the 
topology by one of the 
link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts]
 or "injected" into the storage. The URLs can be already known in the storage.
+* REDIRECTION:: set by the fetcher bolts.
+* FETCH_ERROR:: set by the fetcher bolts.
+* ERROR:: used by either the fetcher, parser, or indexer bolts.
+* FETCHED:: set by the StatusStreamBolt bolt (see below).
+
+The difference between FETCH_ERROR and ERROR is that the former is possibly 
transient whereas the latter is terminal. The bolt which is in charge of 
updating the status (see below) can then decide when and whether to schedule a 
new fetch for a URL based on the status value.
+
+The 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer]
 is useful for notifying the storage layer that a URL has been successfully 
processed, i.e., fetched, parsed, and anything else we want to do with the main 
content. It must be placed just before the StatusUpdaterBolt and sends a tuple 
for the URL on the status stream with a Status value of `fetched`.
+
+The class 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt]
 can be extended to handle status updates for a specific backend. It has an 
internal cache of URLs with a `discovered` status so that they don't get added 
to the backend if they already exist, which is a simple but efficient 
optimisation. It also uses 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/DefaultScheduler.java[DefaultScheduler]
 to compute a next fetch date and calls MetadataTransfer to filter the metadata 
that will be stored in the backend.
+
+In most cases, the extending classes will just need to implement the method 
`store(String URL, Status status, Metadata metadata, Date nextFetch)` and 
handle their own initialisation in `prepare()`. You can find an example of a 
class which extends it in the 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt]
 for Elasticsearch.
+
+
+=== Bolts
+
+==== Fetcher Bolts
+
+There are actually two different bolts for fetching the content of URLs:
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java[SimpleFetcherBolt]
+* 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java[FetcherBolt]
+
+Both declare the same output:
+
+[source,java]
+----
+declarer.declare(new Fields("url", "content", "metadata"));
+declarer.declareStream(
+        org.apache.storm.crawler.Constants.StatusStreamName,
+        new Fields("url", "metadata", "status"));
+----
+
+with the `StatusStream` being used for handling redirections, restrictions by 
robots directives, or fetch errors, whereas the default stream gets the binary 
content returned by the server as well as the metadata to the following 
components (typically a parsing bolt).
+
+Both use the same xref:protocols[Protocols] implementations and 
xref:urlfilters[URLFilters] to control the redirections.
+
+The **FetcherBolt** has an internal set of queues where the incoming URLs are 
placed based on their hostname/domain/IP (see config `fetcher.queue.mode`) and 
a number of **FetchingThreads** (config `fetcher.threads.number` – 10 by 
default) which pull the URLs to fetch from the **FetchQueues**. When doing so, 
they make sure that a minimal amount of time (set with `fetcher.server.delay` – 
default 1 sec) has passed since the previous URL was fetched from the same 
queue. This mechanism ensures that we can control the rate at which requests 
are sent to the servers. A **FetchQueue** can also be used by more than one 
**FetchingThread** at a time (in which case `fetcher.server.min.delay` is 
used), based on the value of `fetcher.threads.per.queue`.
+
+Incoming tuples spend very little time in the 
link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute]
 method of the **FetcherBolt** as they are put in the FetchQueues, which is why 
you'll find that the value of **Execute latency** in the Storm UI is pretty 
low. They get acked later on, after they've been fetched. The metric to watch 
for in the Storm UI is **Process latency**.
+
+The **SimpleFetcherBolt** does not do any of this, hence its name. It just 
fetches incoming tuples in its `execute` method and does not do 
multi-threading. It does enforce politeness by checking when a URL can be 
fetched and will wait until it is the case. It is up to the user to declare 
multiple instances of the bolt in the Topology class and to manage how the URLs 
get distributed across the instances of **SimpleFetcherBolt**, often with the 
help of the link:https:/
+
+=== Indexer Bolts
+The purpose of crawlers is often to index web pages to make them searchable. 
The project contains resources for indexing with popular search solutions such 
as:
+
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/com/digitalpebble/stormcrawler/solr/bolt/IndexerBolt.java[Apache
 SOLR]
+* 
link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java[Elasticsearch]

Review Comment:
   The links here are not valid and I guess we don't want to reference 
ElasticSearch since we no longer support it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to