mvolikas commented on code in PR #1714: URL: https://github.com/apache/stormcrawler/pull/1714#discussion_r2647121833
########## docs/src/main/asciidoc/quick-start.adoc: ########## @@ -0,0 +1,211 @@ +//// +Licensed under the Apache License, Version 2.0 (the "License"); +You may not use this file except in compliance with the License. +You may obtain a copy of the License at: +https://www.apache.org/licenses/LICENSE-2.0 +//// +== Quick Start + +These instructions should help you get Apache StormCrawler up and running in 5 to 15 minutes. + +=== Prerequisites + +To run StormCrawler, you will need Java SE 17 or later. + +Additionally, since we'll be running the required Apache Storm cluster using Docker Compose, +make sure Docker is installed on your operating system. + +=== Terminology + +Before starting, we will give a quick overview of **central** Storm concepts and terminology, you need to know before starting with StormCrawler: + +- *Topology*: A topology is the overall data processing graph in Storm, consisting of spouts and bolts connected together to perform continuous, real-time computations. + +- *Spout*: A spout is a source component in a Storm topology that emits streams of data into the processing pipeline. + +- *Bolt*: A bolt processes, transforms, or routes data streams emitted by spouts or other bolts within the topology. + +- *Flux*: In Apache Storm, Flux is a declarative configuration framework that enables you to define and run Storm topologies using YAML files instead of writing Java code. This simplifies topology management and deployment. + +- *Frontier*: In the context of a web crawler, the Frontier is the component responsible for managing and prioritizing the list of URLs to be fetched next. + +- *Seed*: In web crawling, a seed is an initial URL or set of URLs from which the crawler starts its discovery and fetching process. + +=== Bootstrapping a StormCrawler Project + +You can quickly generate a new StormCrawler project using the Maven archetype: + +[source,shell] +---- +mvn archetype:generate -DarchetypeGroupId=org.apache.stormcrawler \ + -DarchetypeArtifactId=stormcrawler-archetype \ + -DarchetypeVersion=<CURRENT_VERSION> +---- + +Be sure to replace `<CURRENT_VERSION>` with the latest released version of StormCrawler, which you can find on link:https://search.maven.org/artifact/org.apache.stormcrawler/stormcrawler-archetype[search.maven.org]. + +During the process, you’ll be prompted to provide the following: + +* `groupId` (e.g. `com.mycompany.crawler`) +* `artifactId` (e.g. `stormcrawler`) +* Version +* Package name +* User agent details + +IMPORTANT: Specifying a user agent is important for crawler ethics because it identifies your crawler to websites, promoting transparency and allowing site owners to manage or block requests if needed. Be sure to provide a crawler information website as well. + +The archetype will generate a fully-structured project including: + +* A pre-configured `pom.xml` with the necessary dependencies +* Default resource files +* A sample `crawler.flux` configuration +* A basic configuration file + +After generation, navigate into the newly created directory (named after the `artifactId` you specified). + +TIP: You can learn more about the architecture and how each component works together if you look into link:architecture.adoc[the architecture documentation]. Review Comment: The link to the 'Architecture' page seems broken in the live site. We could use an `xref:` instead, like in a previous link to 'Powered by'. ########## docs/src/main/asciidoc/internals.adoc: ########## @@ -0,0 +1,443 @@ +//// +Licensed under the Apache License, Version 2.0 (the "License"); +You may not use this file except in compliance with the License. +You may obtain a copy of the License at: +https://www.apache.org/licenses/LICENSE-2.0 +//// +== Understanding StormCrawler's Internals + +=== Status Stream + +The Apache StormCrawler components rely on two Apache Storm streams: the _default_ one and another one called _status_. + +The aim of the _status_ stream is to pass information about URLs to a persistence layer. Typically, a bespoke bolt will take the tuples coming from the _status_ stream and update the information about URLs in some sort of storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to send new URLs down the topology. + +This is critical for building recursive crawls (i.e., you discover new URLs and not just process known ones). The _default_ stream is used for the URL being processed and is generally used at the end of the pipeline by an indexing bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether the crawler is recursive or not. + +Tuples are emitted on the _status_ stream by the parsing bolts for handling outlinks but also to notify that there has been a problem with a URL (e.g., unparsable content). It is also used by the fetching bolts to handle redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400). + +A bolt which sends tuples on the _status_ stream declares its output in the following way: + +[source,java] +---- +declarer.declareStream( + org.apache.storm.crawler.Constants.StatusStreamName, + new Fields("url", "metadata", "status")); +---- + +As you can see for instance in link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java#L149[SimpleFetcherBolt]. + +The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status] enum has the following values: + +* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the topology by one of the link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts] or "injected" into the storage. The URLs can be already known in the storage. +* REDIRECTION:: set by the fetcher bolts. +* FETCH_ERROR:: set by the fetcher bolts. +* ERROR:: used by either the fetcher, parser, or indexer bolts. +* FETCHED:: set by the StatusStreamBolt bolt (see below). + +The difference between FETCH_ERROR and ERROR is that the former is possibly transient whereas the latter is terminal. The bolt which is in charge of updating the status (see below) can then decide when and whether to schedule a new fetch for a URL based on the status value. + +The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer] is useful for notifying the storage layer that a URL has been successfully processed, i.e., fetched, parsed, and anything else we want to do with the main content. It must be placed just before the StatusUpdaterBolt and sends a tuple for the URL on the status stream with a Status value of `fetched`. + +The class link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt] can be extended to handle status updates for a specific backend. It has an internal cache of URLs with a `discovered` status so that they don't get added to the backend if they already exist, which is a simple but efficient optimisation. It also uses link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/DefaultScheduler.java[DefaultScheduler] to compute a next fetch date and calls MetadataTransfer to filter the metadata that will be stored in the backend. + +In most cases, the extending classes will just need to implement the method `store(String URL, Status status, Metadata metadata, Date nextFetch)` and handle their own initialisation in `prepare()`. You can find an example of a class which extends it in the link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt] for Elasticsearch. + + +=== Bolts + +==== Fetcher Bolts + +There are actually two different bolts for fetching the content of URLs: + +* link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java[SimpleFetcherBolt] +* link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java[FetcherBolt] + +Both declare the same output: + +[source,java] +---- +declarer.declare(new Fields("url", "content", "metadata")); +declarer.declareStream( + org.apache.storm.crawler.Constants.StatusStreamName, + new Fields("url", "metadata", "status")); +---- + +with the `StatusStream` being used for handling redirections, restrictions by robots directives, or fetch errors, whereas the default stream gets the binary content returned by the server as well as the metadata to the following components (typically a parsing bolt). + +Both use the same xref:protocols[Protocols] implementations and xref:urlfilters[URLFilters] to control the redirections. + +The **FetcherBolt** has an internal set of queues where the incoming URLs are placed based on their hostname/domain/IP (see config `fetcher.queue.mode`) and a number of **FetchingThreads** (config `fetcher.threads.number` – 10 by default) which pull the URLs to fetch from the **FetchQueues**. When doing so, they make sure that a minimal amount of time (set with `fetcher.server.delay` – default 1 sec) has passed since the previous URL was fetched from the same queue. This mechanism ensures that we can control the rate at which requests are sent to the servers. A **FetchQueue** can also be used by more than one **FetchingThread** at a time (in which case `fetcher.server.min.delay` is used), based on the value of `fetcher.threads.per.queue`. + +Incoming tuples spend very little time in the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute] method of the **FetcherBolt** as they are put in the FetchQueues, which is why you'll find that the value of **Execute latency** in the Storm UI is pretty low. They get acked later on, after they've been fetched. The metric to watch for in the Storm UI is **Process latency**. + +The **SimpleFetcherBolt** does not do any of this, hence its name. It just fetches incoming tuples in its `execute` method and does not do multi-threading. It does enforce politeness by checking when a URL can be fetched and will wait until it is the case. It is up to the user to declare multiple instances of the bolt in the Topology class and to manage how the URLs get distributed across the instances of **SimpleFetcherBolt**, often with the help of the link:https:/ Review Comment: Possibly a broken link here? ########## docs/src/main/asciidoc/internals.adoc: ########## @@ -0,0 +1,443 @@ +//// +Licensed under the Apache License, Version 2.0 (the "License"); +You may not use this file except in compliance with the License. +You may obtain a copy of the License at: +https://www.apache.org/licenses/LICENSE-2.0 +//// +== Understanding StormCrawler's Internals + +=== Status Stream + +The Apache StormCrawler components rely on two Apache Storm streams: the _default_ one and another one called _status_. + +The aim of the _status_ stream is to pass information about URLs to a persistence layer. Typically, a bespoke bolt will take the tuples coming from the _status_ stream and update the information about URLs in some sort of storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to send new URLs down the topology. + +This is critical for building recursive crawls (i.e., you discover new URLs and not just process known ones). The _default_ stream is used for the URL being processed and is generally used at the end of the pipeline by an indexing bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether the crawler is recursive or not. + +Tuples are emitted on the _status_ stream by the parsing bolts for handling outlinks but also to notify that there has been a problem with a URL (e.g., unparsable content). It is also used by the fetching bolts to handle redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400). + +A bolt which sends tuples on the _status_ stream declares its output in the following way: + +[source,java] +---- +declarer.declareStream( + org.apache.storm.crawler.Constants.StatusStreamName, + new Fields("url", "metadata", "status")); +---- + +As you can see for instance in link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java#L149[SimpleFetcherBolt]. + +The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status] enum has the following values: + +* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the topology by one of the link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts] or "injected" into the storage. The URLs can be already known in the storage. +* REDIRECTION:: set by the fetcher bolts. +* FETCH_ERROR:: set by the fetcher bolts. +* ERROR:: used by either the fetcher, parser, or indexer bolts. +* FETCHED:: set by the StatusStreamBolt bolt (see below). + +The difference between FETCH_ERROR and ERROR is that the former is possibly transient whereas the latter is terminal. The bolt which is in charge of updating the status (see below) can then decide when and whether to schedule a new fetch for a URL based on the status value. + +The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer] is useful for notifying the storage layer that a URL has been successfully processed, i.e., fetched, parsed, and anything else we want to do with the main content. It must be placed just before the StatusUpdaterBolt and sends a tuple for the URL on the status stream with a Status value of `fetched`. + +The class link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt] can be extended to handle status updates for a specific backend. It has an internal cache of URLs with a `discovered` status so that they don't get added to the backend if they already exist, which is a simple but efficient optimisation. It also uses link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/DefaultScheduler.java[DefaultScheduler] to compute a next fetch date and calls MetadataTransfer to filter the metadata that will be stored in the backend. + +In most cases, the extending classes will just need to implement the method `store(String URL, Status status, Metadata metadata, Date nextFetch)` and handle their own initialisation in `prepare()`. You can find an example of a class which extends it in the link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt] for Elasticsearch. + + +=== Bolts + +==== Fetcher Bolts + +There are actually two different bolts for fetching the content of URLs: + +* link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java[SimpleFetcherBolt] +* link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java[FetcherBolt] + +Both declare the same output: + +[source,java] +---- +declarer.declare(new Fields("url", "content", "metadata")); +declarer.declareStream( + org.apache.storm.crawler.Constants.StatusStreamName, + new Fields("url", "metadata", "status")); +---- + +with the `StatusStream` being used for handling redirections, restrictions by robots directives, or fetch errors, whereas the default stream gets the binary content returned by the server as well as the metadata to the following components (typically a parsing bolt). + +Both use the same xref:protocols[Protocols] implementations and xref:urlfilters[URLFilters] to control the redirections. + +The **FetcherBolt** has an internal set of queues where the incoming URLs are placed based on their hostname/domain/IP (see config `fetcher.queue.mode`) and a number of **FetchingThreads** (config `fetcher.threads.number` – 10 by default) which pull the URLs to fetch from the **FetchQueues**. When doing so, they make sure that a minimal amount of time (set with `fetcher.server.delay` – default 1 sec) has passed since the previous URL was fetched from the same queue. This mechanism ensures that we can control the rate at which requests are sent to the servers. A **FetchQueue** can also be used by more than one **FetchingThread** at a time (in which case `fetcher.server.min.delay` is used), based on the value of `fetcher.threads.per.queue`. + +Incoming tuples spend very little time in the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute] method of the **FetcherBolt** as they are put in the FetchQueues, which is why you'll find that the value of **Execute latency** in the Storm UI is pretty low. They get acked later on, after they've been fetched. The metric to watch for in the Storm UI is **Process latency**. + +The **SimpleFetcherBolt** does not do any of this, hence its name. It just fetches incoming tuples in its `execute` method and does not do multi-threading. It does enforce politeness by checking when a URL can be fetched and will wait until it is the case. It is up to the user to declare multiple instances of the bolt in the Topology class and to manage how the URLs get distributed across the instances of **SimpleFetcherBolt**, often with the help of the link:https:/ + +=== Indexer Bolts +The purpose of crawlers is often to index web pages to make them searchable. The project contains resources for indexing with popular search solutions such as: + +* link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/com/digitalpebble/stormcrawler/solr/bolt/IndexerBolt.java[Apache SOLR] +* link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java[Elasticsearch] +* link:https://github.com/apache/stormcrawler/blob/main/external/aws/src/main/java/com/digitalpebble/stormcrawler/aws/bolt/CloudSearchIndexerBolt.java[AWS CloudSearch] + +All of these extend the class link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java[AbstractIndexerBolt]. + +The core module also contains a link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/StdOutIndexer.java[simple indexer] which dumps the documents into the standard output – useful for debugging – as well as a link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer]. + +The basic functionalities of filtering a document to index, mapping the metadata (which determines which metadata to keep for indexing and under what field name), or using the canonical tag (if any) are handled by the abstract class. This allows implementations to focus on communication with the indexing APIs. + +Indexing is often the penultimate component in a pipeline and takes the output of a Parsing bolt on the standard stream. The output of the indexing bolts is on the _status_ stream: + +[source,java] +---- +public void declareOutputFields(OutputFieldsDeclarer declarer) { + declarer.declareStream( + org.apache.stormcrawler.Constants.StatusStreamName, + new Fields("url", "metadata", "status")); +} +---- + +The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer] is used for cases where no actual indexing is required. It simply generates a tuple on the _status_ stream so that any StatusUpdater bolt knows that the URL was processed successfully and can update its status and scheduling in the corresponding backend. + +You can easily build your own custom indexer to integrate with other storage systems, such as a vector database for semantic search, a graph database for network analysis, or any other specialized data store. By extending AbstractIndexerBolt, you only need to implement the logic to communicate with your target system, while StormCrawler handles the rest of the pipeline and status updates. + +=== Parser Bolts +==== JSoupParserBolt + +The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java[JSoupParserBolt] can be used to parse HTML documents and extract the outlinks, text, and metadata it contains. If you want to parse non-HTML documents, use the link:https://github.com/apache/stormcrawler/tree/main/external/src/main/java/com/digitalpebble/storm/crawler/tika[Tika-based ParserBolt] from the external modules. Review Comment: This link seems broken. We should probably have to change all the digitalpebble links. ########## docs/src/main/asciidoc/internals.adoc: ########## @@ -0,0 +1,443 @@ +//// +Licensed under the Apache License, Version 2.0 (the "License"); +You may not use this file except in compliance with the License. +You may obtain a copy of the License at: +https://www.apache.org/licenses/LICENSE-2.0 +//// +== Understanding StormCrawler's Internals + +=== Status Stream + +The Apache StormCrawler components rely on two Apache Storm streams: the _default_ one and another one called _status_. + +The aim of the _status_ stream is to pass information about URLs to a persistence layer. Typically, a bespoke bolt will take the tuples coming from the _status_ stream and update the information about URLs in some sort of storage (e.g., ElasticSearch, HBase, etc...), which is then used by a Spout to send new URLs down the topology. + +This is critical for building recursive crawls (i.e., you discover new URLs and not just process known ones). The _default_ stream is used for the URL being processed and is generally used at the end of the pipeline by an indexing bolt (which could also be ElasticSearch, HBase, etc...), regardless of whether the crawler is recursive or not. + +Tuples are emitted on the _status_ stream by the parsing bolts for handling outlinks but also to notify that there has been a problem with a URL (e.g., unparsable content). It is also used by the fetching bolts to handle redirections, exceptions, and unsuccessful fetch status (e.g., HTTP code 400). + +A bolt which sends tuples on the _status_ stream declares its output in the following way: + +[source,java] +---- +declarer.declareStream( + org.apache.storm.crawler.Constants.StatusStreamName, + new Fields("url", "metadata", "status")); +---- + +As you can see for instance in link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java#L149[SimpleFetcherBolt]. + +The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/Status.java[Status] enum has the following values: + +* DISCOVERED:: outlinks found by the parsers or "seed" URLs emitted into the topology by one of the link:https://stormcrawler.net/docs/api/com/digitalpebble/stormcrawler/spout/package-summary.html[spouts] or "injected" into the storage. The URLs can be already known in the storage. +* REDIRECTION:: set by the fetcher bolts. +* FETCH_ERROR:: set by the fetcher bolts. +* ERROR:: used by either the fetcher, parser, or indexer bolts. +* FETCHED:: set by the StatusStreamBolt bolt (see below). + +The difference between FETCH_ERROR and ERROR is that the former is possibly transient whereas the latter is terminal. The bolt which is in charge of updating the status (see below) can then decide when and whether to schedule a new fetch for a URL based on the status value. + +The link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/DummyIndexer.java[DummyIndexer] is useful for notifying the storage layer that a URL has been successfully processed, i.e., fetched, parsed, and anything else we want to do with the main content. It must be placed just before the StatusUpdaterBolt and sends a tuple for the URL on the status stream with a Status value of `fetched`. + +The class link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java[AbstractStatusUpdaterBolt] can be extended to handle status updates for a specific backend. It has an internal cache of URLs with a `discovered` status so that they don't get added to the backend if they already exist, which is a simple but efficient optimisation. It also uses link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/DefaultScheduler.java[DefaultScheduler] to compute a next fetch date and calls MetadataTransfer to filter the metadata that will be stored in the backend. + +In most cases, the extending classes will just need to implement the method `store(String URL, Status status, Metadata metadata, Date nextFetch)` and handle their own initialisation in `prepare()`. You can find an example of a class which extends it in the link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java[StatusUpdaterBolt] for Elasticsearch. + + +=== Bolts + +==== Fetcher Bolts + +There are actually two different bolts for fetching the content of URLs: + +* link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java[SimpleFetcherBolt] +* link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java[FetcherBolt] + +Both declare the same output: + +[source,java] +---- +declarer.declare(new Fields("url", "content", "metadata")); +declarer.declareStream( + org.apache.storm.crawler.Constants.StatusStreamName, + new Fields("url", "metadata", "status")); +---- + +with the `StatusStream` being used for handling redirections, restrictions by robots directives, or fetch errors, whereas the default stream gets the binary content returned by the server as well as the metadata to the following components (typically a parsing bolt). + +Both use the same xref:protocols[Protocols] implementations and xref:urlfilters[URLFilters] to control the redirections. + +The **FetcherBolt** has an internal set of queues where the incoming URLs are placed based on their hostname/domain/IP (see config `fetcher.queue.mode`) and a number of **FetchingThreads** (config `fetcher.threads.number` – 10 by default) which pull the URLs to fetch from the **FetchQueues**. When doing so, they make sure that a minimal amount of time (set with `fetcher.server.delay` – default 1 sec) has passed since the previous URL was fetched from the same queue. This mechanism ensures that we can control the rate at which requests are sent to the servers. A **FetchQueue** can also be used by more than one **FetchingThread** at a time (in which case `fetcher.server.min.delay` is used), based on the value of `fetcher.threads.per.queue`. + +Incoming tuples spend very little time in the link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java#L768[execute] method of the **FetcherBolt** as they are put in the FetchQueues, which is why you'll find that the value of **Execute latency** in the Storm UI is pretty low. They get acked later on, after they've been fetched. The metric to watch for in the Storm UI is **Process latency**. + +The **SimpleFetcherBolt** does not do any of this, hence its name. It just fetches incoming tuples in its `execute` method and does not do multi-threading. It does enforce politeness by checking when a URL can be fetched and will wait until it is the case. It is up to the user to declare multiple instances of the bolt in the Topology class and to manage how the URLs get distributed across the instances of **SimpleFetcherBolt**, often with the help of the link:https:/ + +=== Indexer Bolts +The purpose of crawlers is often to index web pages to make them searchable. The project contains resources for indexing with popular search solutions such as: + +* link:https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/com/digitalpebble/stormcrawler/solr/bolt/IndexerBolt.java[Apache SOLR] +* link:https://github.com/apache/stormcrawler/blob/main/external/opensearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/bolt/IndexerBolt.java[Elasticsearch] Review Comment: The links here are not valid and I guess we don't want to reference ElasticSearch since we no longer support it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
