This is an automated email from the ASF dual-hosted git repository.
rzo1 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/stormcrawler.git
The following commit(s) were added to refs/heads/main by this push:
new df259067 Remove Incubating references since we have graduated (#1538)
df259067 is described below
commit df25906790dc8c960f0ecec932b0795b374cd935
Author: Richard Zowalla <[email protected]>
AuthorDate: Tue May 27 21:30:27 2025 +0200
Remove Incubating references since we have graduated (#1538)
* Remove Incubating references since we have graduated
* Rename GitHub/Git/Dist references in preparation for the move out of the
incubator (infra ticket already filled)
* Fix formatting
---
.github/ISSUE_TEMPLATE/bug-report.yml | 4 +--
.github/ISSUE_TEMPLATE/feature-request.yml | 2 +-
.github/workflows/snapshots.yaml | 2 +-
DISCLAIMER | 10 ------
NOTICE | 2 +-
README.md | 14 ++++----
RELEASING.md | 38 +++++++++++-----------
core/pom.xml | 2 +-
.../org/apache/stormcrawler/bolt/FetcherBolt.java | 6 ++--
.../apache/stormcrawler/bolt/JSoupParserBolt.java | 2 +-
.../stormcrawler/bolt/SimpleFetcherBolt.java | 4 +--
.../filtering/basic/BasicURLNormalizer.java | 2 +-
.../filtering/regex/FastURLFilter.java | 2 +-
.../filtering/sitemap/SitemapFilter.java | 4 +--
.../persistence/AbstractStatusUpdaterBolt.java | 2 +-
.../stormcrawler/protocol/ProtocolResponse.java | 2 +-
.../stormcrawler/util/CharsetIdentification.java | 2 +-
.../filtering/BasicURLNormalizerTest.java | 2 +-
.../stormcrawler/jsoup/JSoupFiltersTest.java | 2 +-
.../stormcrawler/parse/StackOverflowTest.java | 4 +--
.../stormcrawler/parse/filter/XPathFilterTest.java | 2 +-
core/src/test/resources/longtext.html | 24 +++++++-------
core/src/test/resources/stackexception.html | 24 +++++++-------
.../test/resources/stormcrawler.apache.org.html | 24 +++++++-------
external/aws/pom.xml | 2 +-
external/langid/pom.xml | 2 +-
external/opensearch/README.md | 8 ++---
.../main/resources/archetype-resources/README.md | 2 +-
external/opensearch/pom.xml | 2 +-
.../stormcrawler/opensearch/bolt/DeletionBolt.java | 2 +-
.../stormcrawler/opensearch/bolt/IndexerBolt.java | 2 +-
.../opensearch/persistence/StatusUpdaterBolt.java | 2 +-
.../opensearch/bolt/IndexerBoltTest.java | 2 +-
.../opensearch/bolt/StatusBoltTest.java | 2 +-
external/playwright/README.md | 2 +-
external/playwright/pom.xml | 2 +-
external/solr/README.md | 8 ++---
.../main/resources/archetype-resources/README.md | 2 +-
external/solr/pom.xml | 2 +-
external/sql/README.md | 4 +--
external/sql/pom.xml | 2 +-
external/tika/README.md | 2 +-
external/tika/pom.xml | 2 +-
.../apache/stormcrawler/tika/ParserBoltTest.java | 2 +-
external/urlfrontier/pom.xml | 2 +-
.../urlfrontier/ManagedChannelUtil.java | 2 +-
external/warc/pom.xml | 2 +-
.../stormcrawler/warc/WARCRequestRecordFormat.java | 2 +-
pom.xml | 12 +++----
49 files changed, 123 insertions(+), 133 deletions(-)
diff --git a/.github/ISSUE_TEMPLATE/bug-report.yml
b/.github/ISSUE_TEMPLATE/bug-report.yml
index 82d8204a..becef783 100644
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -8,7 +8,7 @@ body:
id: version
attributes:
label: Version
- description: What version of "Apache StormCrawler (Incubating) are you
using?"
+ description: What version of "Apache StormCrawler are you using?"
options:
- main branch
- stormcrawler-3.2.0
@@ -35,7 +35,7 @@ body:
attributes:
label: How to reproduce
placeholder: |
- + Which version of Apache StormCrawler (Incubating) version to use.
+ + Which version of Apache StormCrawler version to use.
validations:
required: true
- type: textarea
diff --git a/.github/ISSUE_TEMPLATE/feature-request.yml
b/.github/ISSUE_TEMPLATE/feature-request.yml
index 4aaa7ac0..54e4bd4e 100644
--- a/.github/ISSUE_TEMPLATE/feature-request.yml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yml
@@ -1,6 +1,6 @@
name: Feature Request
title: "[FEATURE] "
-description: Suggest an idea for Apache StormCrawler (Incubating)
+description: Suggest an idea for Apache StormCrawler
labels: [ "feature" ]
body:
diff --git a/.github/workflows/snapshots.yaml b/.github/workflows/snapshots.yaml
index 6c4317b8..495beb66 100644
--- a/.github/workflows/snapshots.yaml
+++ b/.github/workflows/snapshots.yaml
@@ -25,7 +25,7 @@ on:
jobs:
upload_to_nightlies:
- if: github.repository == 'apache/incubator-stormcrawler'
+ if: github.repository == 'apache/stormcrawler'
name: Publish Snapshots
runs-on: ubuntu-latest
steps:
diff --git a/DISCLAIMER b/DISCLAIMER
deleted file mode 100644
index 065859f7..00000000
--- a/DISCLAIMER
+++ /dev/null
@@ -1,10 +0,0 @@
-Apache StormCrawler is an effort undergoing incubation at the Apache Software
-Foundation (ASF), sponsored by the Apache Incubator PMC.
-
-Incubation is required of all newly accepted projects until a further review
-indicates that the infrastructure, communications, and decision making process
-have stabilized in a manner consistent with other successful ASF projects.
-
-While incubation status is not necessarily a reflection of the completeness
-or stability of the code, it does indicate that the project has yet to be
-fully endorsed by the ASF.
diff --git a/NOTICE b/NOTICE
index c0bc3973..63f83326 100644
--- a/NOTICE
+++ b/NOTICE
@@ -1,4 +1,4 @@
-Apache StormCrawler (Incubating)
+Apache StormCrawler
Copyright 2025 The Apache Software Foundation
This product includes software developed by The Apache Software
diff --git a/README.md b/README.md
index 97d55741..8294aa45 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,11 @@
[](https://stormcrawler.apache.org/)
=============
-[](http://www.apache.org/licenses/LICENSE-2.0)
-
-[](https://javadoc.io/doc/org.apache.stormcrawler/stormcrawler-core/)
+[](http://www.apache.org/licenses/LICENSE-2.0)
+
+[](https://javadoc.io/doc/org.apache.stormcrawler/stormcrawler-core/)
-Apache StormCrawler (Incubating) is an open source collection of resources for
building low-latency, scalable web crawlers on [Apache
Storm](http://storm.apache.org/). It is provided under [Apache
License](http://www.apache.org/licenses/LICENSE-2.0) and is written mostly in
Java.
+Apache StormCrawler is an open source collection of resources for building
low-latency, scalable web crawlers on [Apache Storm](http://storm.apache.org/).
It is provided under [Apache
License](http://www.apache.org/licenses/LICENSE-2.0) and is written mostly in
Java.
## Quickstart
@@ -24,13 +24,13 @@ You'll be asked to enter a groupId (e.g.
com.mycompany.crawler), an artefactId (
This will not only create a fully formed project containing a POM with the
dependency above but also the default resource files, a default CrawlTopology
class and a configuration file. Enter the directory you just created (should be
the same as the artefactId you specified earlier) and follow the instructions
on the README file.
-Alternatively if you can't or don't want to use the Maven archetype above, you
can simply copy the files from
[archetype-resources](https://github.com/apache/incubator-stormcrawler/tree/master/archetype/src/main/resources/archetype-resources).
+Alternatively if you can't or don't want to use the Maven archetype above, you
can simply copy the files from
[archetype-resources](https://github.com/apache/stormcrawler/tree/master/archetype/src/main/resources/archetype-resources).
-Have a look at
[crawler.flux](https://github.com/apache/incubator-stormcrawler/blob/master/archetype/src/main/resources/archetype-resources/crawler.flux),
the
[crawler-conf.yaml](https://github.com/apache/incubator-stormcrawler/blob/master/archetype/src/main/resources/archetype-resources/crawler-conf.yaml)
file as well as the files in
[src/main/resources/](https://github.com/apache/incubator-stormcrawler/tree/master/archetype/src/main/resources/archetype-resources/src/main/resources),
th [...]
+Have a look at
[crawler.flux](https://github.com/apache/stormcrawler/blob/master/archetype/src/main/resources/archetype-resources/crawler.flux),
the
[crawler-conf.yaml](https://github.com/apache/stormcrawler/blob/master/archetype/src/main/resources/archetype-resources/crawler-conf.yaml)
file as well as the files in
[src/main/resources/](https://github.com/apache/stormcrawler/tree/master/archetype/src/main/resources/archetype-resources/src/main/resources),
they are all that is needed to r [...]
## Getting help
-The [WIKI](https://github.com/apache/incubator-stormcrawler/wiki) is a good
place to start your investigations but if you are stuck please use the tag
[stormcrawler](http://stackoverflow.com/questions/tagged/stormcrawler) on
StackOverflow or ask a question in the
[discussions](https://github.com/apache/incubator-stormcrawler/discussions)
section.
+The [WIKI](https://github.com/apache/stormcrawler/wiki) is a good place to
start your investigations but if you are stuck please use the tag
[stormcrawler](http://stackoverflow.com/questions/tagged/stormcrawler) on
StackOverflow or ask a question in the
[discussions](https://github.com/apache/stormcrawler/discussions) section.
The project website has a page listing companies providing [commercial
support](https://stormcrawler.apache.org/support/) for Apache StormCrawler.
diff --git a/RELEASING.md b/RELEASING.md
index fcd22d5a..ad9afac5 100644
--- a/RELEASING.md
+++ b/RELEASING.md
@@ -1,10 +1,10 @@
-# Guide to release Apache StormCrawler (Incubating)
+# Guide to release Apache StormCrawler
## Release Preparation
- Select a release manager on the dev mailing list. A release manager should
be a committer and should preferably switch between releases to have a transfer
in knowledge.
-- Create an issue for a new release in
<https://github.com/apache/incubator-stormcrawler/issues>
-- Review all [issues](https://github.com/apache/incubator-stormcrawler/issues)
associated with the release. All issues should be resolved and closed.
+- Create an issue for a new release in
<https://github.com/apache/stormcrawler/issues>
+- Review all [issues](https://github.com/apache/stormcrawler/issues)
associated with the release. All issues should be resolved and closed.
- Any issues assigned to the release that are not complete should be assigned
to the next release. Any critical or blocker issues should be resolved on the
mailing list. Discuss any issues that you are unsure of on the mailing list.
## Steps for the Release Manager
@@ -13,7 +13,7 @@ The following steps need only to be performed once.
- Make sure you have your PGP fingerprint added into <https://id.apache.org/>
- Make sure you have your PGP keys password.
-- Add your PGP key to the
[KEYS](https://dist.apache.org/repos/dist/release/incubator/stormcrawler/KEYS)
file.
+- Add your PGP key to the
[KEYS](https://dist.apache.org/repos/dist/release/stormcrawler/KEYS) file.
Examples of adding your key to this file:
@@ -84,7 +84,7 @@ export GPG_TTY=$(tty)
## Release Steps
-- Checkout the Apache StormCrawler main branch: `git clone
[email protected]:apache/incubator-stormcrawler.git`
+- Checkout the Apache StormCrawler main branch: `git clone
[email protected]:apache/stormcrawler.git`
- Execute a complete test: `mvn test`
- Ensure to have a working Docker environment on your release machine.
Otherwise, coverage computation goes wrong and the build will fail.
- Check the current results of the last GitHub action runs.
@@ -152,14 +152,14 @@ gpg --homedir . --output
apache-stormcrawler-x.y.z-incubating-source-release.ta
- Run a global replace of the old version with the new version.
- Prepare a preview via the staging environment of the website.
- Ensure the website is updated on <https://stormcrawler.staged.apache.org>
-- Note: Instruction on how to do so can be found on
<https://github.com/apache/incubator-stormcrawler-site>
+- Note: Instruction on how to do so can be found on
<https://github.com/apache/stormcrawler-site>
### Create a draft release on Github
-- Create a new Draft Release -- on
<https://github.com/apache/incubator-stormcrawler/releases>, click `Draft a new
release` and select the `stormcrawler-X.Y.Z` tag.
+- Create a new Draft Release -- on
<https://github.com/apache/stormcrawler/releases>, click `Draft a new release`
and select the `stormcrawler-X.Y.Z` tag.
- Click the `Generate Release Notes` (**MAKE SURE TO SELECT THE CORRECT
PREVIOUS RELEASE AS THE BASE**). Copy and paste the Disclaimer and Release
Summary from the previous release and update the Release Summary as appropriate.
- Click the `Set as pre-release` button.
-- Click `Publish release`. The release should not have `*-rc1` in its title,
e.g.:
`https://github.com/apache/incubator-stormcrawler/releases/tag/stormcrawler-3.2.0`
+- Click `Publish release`. The release should not have `*-rc1` in its title,
e.g.: `https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.2.0`
#### Create a VOTE Thread
@@ -171,20 +171,20 @@ The VOTE process is two-fold:
- Be sure to replace all values in `[]` with the appropriate values.
```bash
-Message Subject: [VOTE] Apache StormCrawler (Incubating) [version] Release
Candidate
+Message Subject: [VOTE] Apache StormCrawler [version] Release Candidate
----
Hi folks,
-I have posted a [Nth] release candidate for the Apache StormCrawler
(Incubating) [version] release and it is ready for testing.
+I have posted a [Nth] release candidate for the Apache StormCrawler[version]
release and it is ready for testing.
<Add a summary to highlight notable changes>
Thank you to everyone who contributed to this release, including all of our
users and the people who submitted bug reports,
contributed code or documentation enhancements.
-The release was made using the Apache StormCrawler (Incubating) release
process, documented here:
-https://github.com/apache/incubator-stormcrawler/blob/main/RELEASING.md
+The release was made using the Apache StormCrawler release process, documented
here:
+https://github.com/apache/stormcrawler/blob/main/RELEASING.md
Source:
@@ -192,7 +192,7 @@
https://dist.apache.org/repos/dist/dev/incubator/stormcrawler/stormcrawler-x.y.z
Tag:
-https://github.com/apache/incubator-stormcrawler/releases/tag/stormcrawler-x.y.z
+https://github.com/apache/stormcrawler/releases/tag/stormcrawler-x.y.z
Commit Hash:
@@ -250,7 +250,7 @@ The vote is successful if at least 3 *+1* votes are
received from IPMC members a
Acknowledge the voting results on the mailing list in the VOTE thread by
sending a mail.
```bash
-Message Subject: [RESULT] [VOTE] Apache StormCrawler (Incubating) [version]
+Message Subject: [RESULT] [VOTE] Apache StormCrawler [version]
Hi folks,
@@ -296,7 +296,7 @@ Remove the old releases from SVN under
<https://dist.apache.org/repos/dist/relea
- Merge the release branch to `main` to start the website deployment.
- Check, that the website is deployed successfully.
-- Instruction on how to do so can be found on
<https://github.com/apache/incubator-stormcrawler-site>
+- Instruction on how to do so can be found on
<https://github.com/apache/stormcrawler-site>
### Make the release on Github
@@ -310,18 +310,18 @@ Remove the old releases from SVN under
<https://dist.apache.org/repos/dist/relea
- It needs to be sent from your **@apache.org** email address or the email
will bounce from the announce list.
```bash
-Title: [ANNOUNCE] Apache StormCrawler (Incubating) <version> released
+Title: [ANNOUNCE] Apache StormCrawler <version> released
TO: [email protected], [email protected],
[email protected]
----
Message body:
----
-The Apache StormCrawler (Incubating) team is pleased to announce the release
of version <version> of Apache StormCrawler.
+The Apache StormCrawler team is pleased to announce the release of version
<version> of Apache StormCrawler.
StormCrawler is a collection of resources for building low-latency,
customisable and scalable web crawlers on Apache Storm.
-Apache StormCrawler (Incubating) <version> source distributions is available
for download from our download page:
https://stormcrawler.apache.org/download/index.html
-Apache StormCrawler (Incubating) is distributed by Maven Central as well.
+Apache StormCrawler <version> source distributions is available for download
from our download page: https://stormcrawler.apache.org/download/index.html
+Apache StormCrawler is distributed by Maven Central as well.
Changes in this version:
diff --git a/core/pom.xml b/core/pom.xml
index 3577ad49..5724f4d6 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -32,7 +32,7 @@ under the License.
<packaging>jar</packaging>
<name>stormcrawler-core</name>
-
<url>https://github.com/apache/incubator-stormcrawler/tree/master/core</url>
+ <url>https://github.com/apache/stormcrawler/tree/master/core</url>
<description>StormCrawler core Java API.</description>
<properties>
diff --git a/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java
b/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java
index 3f1477d1..ba3e8ce9 100644
--- a/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java
+++ b/core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java
@@ -509,7 +509,7 @@ public class FetcherBolt extends StatusEmitterBolt {
metadata = new Metadata();
}
- // https://github.com/apache/incubator-stormcrawler/issues/813
+ // https://github.com/apache/stormcrawler/issues/813
metadata.remove("fetch.exception");
boolean asap = false;
@@ -568,7 +568,7 @@ public class FetcherBolt extends StatusEmitterBolt {
}
// has found sitemaps
- //
https://github.com/apache/incubator-stormcrawler/issues/710
+ // https://github.com/apache/stormcrawler/issues/710
// note: we don't care if the sitemap URLs where actually
// kept
boolean foundSitemap = (rules.getSitemaps().size() > 0);
@@ -732,7 +732,7 @@ public class FetcherBolt extends StatusEmitterBolt {
mergedMD.setValue("_redirTo", redirection);
}
- //
https://github.com/apache/incubator-stormcrawler/issues/954
+ // https://github.com/apache/stormcrawler/issues/954
if (allowRedirs() &&
StringUtils.isNotBlank(redirection)) {
emitOutlink(fit.t, url, redirection, mergedMD);
}
diff --git
a/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java
b/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java
index 17214a4d..01d9d797 100644
--- a/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java
+++ b/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java
@@ -347,7 +347,7 @@ public class JSoupParserBolt extends StatusEmitterBolt {
LOG.info("Found redir in {} to {}", url, redirection);
metadata.setValue("_redirTo", redirection);
- //
https://github.com/apache/incubator-stormcrawler/issues/954
+ // https://github.com/apache/stormcrawler/issues/954
if (allowRedirs() && StringUtils.isNotBlank(redirection)) {
emitOutlink(tuple, new URL(url), redirection,
metadata);
}
diff --git
a/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java
b/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java
index 0f783d78..11aefaf2 100644
--- a/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java
+++ b/core/src/main/java/org/apache/stormcrawler/bolt/SimpleFetcherBolt.java
@@ -256,7 +256,7 @@ public class SimpleFetcherBolt extends StatusEmitterBolt {
metadata = new Metadata();
}
- // https://github.com/apache/incubator-stormcrawler/issues/813
+ // https://github.com/apache/stormcrawler/issues/813
metadata.remove("fetch.exception");
URL url;
@@ -326,7 +326,7 @@ public class SimpleFetcherBolt extends StatusEmitterBolt {
}
// has found sitemaps
- // https://github.com/apache/incubator-stormcrawler/issues/710
+ // https://github.com/apache/stormcrawler/issues/710
// note: we don't care if the sitemap URLs where actually
// kept
boolean foundSitemap = (rules.getSitemaps().size() > 0);
diff --git
a/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java
b/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java
index 629bc976..d6535546 100644
---
a/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java
+++
b/core/src/main/java/org/apache/stormcrawler/filtering/basic/BasicURLNormalizer.java
@@ -50,7 +50,7 @@ public class BasicURLNormalizer extends URLFilter {
/** Nutch 1098 - finds URL encoded parts of the URL */
private static final Pattern unescapeRulePattern =
Pattern.compile("%([0-9A-Fa-f]{2})");
- /** https://github.com/apache/incubator-stormcrawler/issues/401 * */
+ /** https://github.com/apache/stormcrawler/issues/401 * */
private static final Pattern illegalEscapePattern =
Pattern.compile("%u([0-9A-Fa-f]{4})");
// charset used for encoding URLs before escaping
diff --git
a/core/src/main/java/org/apache/stormcrawler/filtering/regex/FastURLFilter.java
b/core/src/main/java/org/apache/stormcrawler/filtering/regex/FastURLFilter.java
index 230796ac..b2391b95 100644
---
a/core/src/main/java/org/apache/stormcrawler/filtering/regex/FastURLFilter.java
+++
b/core/src/main/java/org/apache/stormcrawler/filtering/regex/FastURLFilter.java
@@ -112,7 +112,7 @@ public class FastURLFilter extends URLFilter implements
JSONResource {
// if it contains a single object
// jump directly to its content
- // https://github.com/apache/incubator-stormcrawler/issues/1013
+ // https://github.com/apache/stormcrawler/issues/1013
if (rootNode.size() == 1 && rootNode.isObject()) {
rootNode = rootNode.fields().next().getValue();
}
diff --git
a/core/src/main/java/org/apache/stormcrawler/filtering/sitemap/SitemapFilter.java
b/core/src/main/java/org/apache/stormcrawler/filtering/sitemap/SitemapFilter.java
index 498b7378..920caf54 100644
---
a/core/src/main/java/org/apache/stormcrawler/filtering/sitemap/SitemapFilter.java
+++
b/core/src/main/java/org/apache/stormcrawler/filtering/sitemap/SitemapFilter.java
@@ -39,8 +39,8 @@ import org.jetbrains.annotations.Nullable;
* </pre>
*
* <p>Will be replaced by <a href=
- *
"https://github.com/apache/incubator-stormcrawler/issues/711">MetadataFilter to
filter based on
- * multiple key values</a>
+ * "https://github.com/apache/stormcrawler/issues/711">MetadataFilter to
filter based on multiple
+ * key values</a>
*
* @since 1.14
*/
diff --git
a/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java
b/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java
index 44d7a89f..cc96c877 100644
---
a/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java
+++
b/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java
@@ -207,7 +207,7 @@ public abstract class AbstractStatusUpdaterBolt extends
BaseRichBolt {
if (!status.equals(Status.FETCH_ERROR)) {
metadata.remove(Constants.fetchErrorCountParamName);
}
- // https://github.com/apache/incubator-stormcrawler/issues/415
+ // https://github.com/apache/stormcrawler/issues/415
// remove error related key values in case of success
if (status.equals(Status.FETCHED) ||
status.equals(Status.REDIRECTION)) {
metadata.remove(Constants.STATUS_ERROR_CAUSE);
diff --git
a/core/src/main/java/org/apache/stormcrawler/protocol/ProtocolResponse.java
b/core/src/main/java/org/apache/stormcrawler/protocol/ProtocolResponse.java
index b79163d8..e5ea0584 100644
--- a/core/src/main/java/org/apache/stormcrawler/protocol/ProtocolResponse.java
+++ b/core/src/main/java/org/apache/stormcrawler/protocol/ProtocolResponse.java
@@ -58,7 +58,7 @@ public class ProtocolResponse {
/**
* @since 1.17
- * @see <a
href="https://github.com/apache/incubator-stormcrawler/issues/776">Issue 776</a>
+ * @see <a href="https://github.com/apache/stormcrawler/issues/776">Issue
776</a>
*/
public static final String PROTOCOL_MD_PREFIX_PARAM = "protocol.md.prefix";
diff --git
a/core/src/main/java/org/apache/stormcrawler/util/CharsetIdentification.java
b/core/src/main/java/org/apache/stormcrawler/util/CharsetIdentification.java
index 1ef8a712..7cab3323 100644
--- a/core/src/main/java/org/apache/stormcrawler/util/CharsetIdentification.java
+++ b/core/src/main/java/org/apache/stormcrawler/util/CharsetIdentification.java
@@ -186,7 +186,7 @@ public class CharsetIdentification {
int start = html.indexOf("<meta charset=\"");
if (start != -1) {
int end = html.indexOf('"', start + 15);
- // https://github.com/apache/incubator-stormcrawler/issues/870
+ // https://github.com/apache/stormcrawler/issues/870
// try on a slightly larger section of text if it is trimmed
if (end == -1 && ((maxlength + 10) < buffer.length)) {
return getCharsetFromMeta(buffer, maxlength + 10);
diff --git
a/core/src/test/java/org/apache/stormcrawler/filtering/BasicURLNormalizerTest.java
b/core/src/test/java/org/apache/stormcrawler/filtering/BasicURLNormalizerTest.java
index 250ea401..65da7630 100644
---
a/core/src/test/java/org/apache/stormcrawler/filtering/BasicURLNormalizerTest.java
+++
b/core/src/test/java/org/apache/stormcrawler/filtering/BasicURLNormalizerTest.java
@@ -289,7 +289,7 @@ class BasicURLNormalizerTest {
assertEquals(expectedResult, normalizedUrl, "Failed to filter query
string");
}
- // https://github.com/apache/incubator-stormcrawler/issues/401
+ // https://github.com/apache/stormcrawler/issues/401
@Test
void testNonStandardPercentEncoding() throws MalformedURLException {
URLFilter urlFilter = createFilter(false, false);
diff --git
a/core/src/test/java/org/apache/stormcrawler/jsoup/JSoupFiltersTest.java
b/core/src/test/java/org/apache/stormcrawler/jsoup/JSoupFiltersTest.java
index ff1199de..fb1b2e42 100644
--- a/core/src/test/java/org/apache/stormcrawler/jsoup/JSoupFiltersTest.java
+++ b/core/src/test/java/org/apache/stormcrawler/jsoup/JSoupFiltersTest.java
@@ -58,7 +58,7 @@ class JSoupFiltersTest extends ParsingTester {
}
@Test
- // https://github.com/apache/incubator-stormcrawler/issues/219
+ // https://github.com/apache/stormcrawler/issues/219
void testScriptExtraction() throws IOException {
prepareParserBolt("test.jsoupfilters.json");
parse("https://stormcrawler.apache.org",
"stormcrawler.apache.org.html");
diff --git
a/core/src/test/java/org/apache/stormcrawler/parse/StackOverflowTest.java
b/core/src/test/java/org/apache/stormcrawler/parse/StackOverflowTest.java
index eed28321..1f1e9f35 100644
--- a/core/src/test/java/org/apache/stormcrawler/parse/StackOverflowTest.java
+++ b/core/src/test/java/org/apache/stormcrawler/parse/StackOverflowTest.java
@@ -28,7 +28,7 @@ import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
/**
- * @see https://github.com/apache/incubator-stormcrawler/pull/653 *
+ * @see https://github.com/apache/stormcrawler/pull/653 *
*/
class StackOverflowTest extends ParsingTester {
@@ -47,7 +47,7 @@ class StackOverflowTest extends ParsingTester {
}
/**
- * @see https://github.com/apache/incubator-stormcrawler/issues/666
+ * @see https://github.com/apache/stormcrawler/issues/666
*/
@Test
void testNamespaceExtraction() throws IOException {
diff --git
a/core/src/test/java/org/apache/stormcrawler/parse/filter/XPathFilterTest.java
b/core/src/test/java/org/apache/stormcrawler/parse/filter/XPathFilterTest.java
index d31d1d9b..c05cc3dd 100644
---
a/core/src/test/java/org/apache/stormcrawler/parse/filter/XPathFilterTest.java
+++
b/core/src/test/java/org/apache/stormcrawler/parse/filter/XPathFilterTest.java
@@ -48,7 +48,7 @@ class XPathFilterTest extends ParsingTester {
}
@Test
- // https://github.com/apache/incubator-stormcrawler/issues/219
+ // https://github.com/apache/stormcrawler/issues/219
void testScriptExtraction() throws IOException {
prepareParserBolt("test.parsefilters.json");
parse("https://stormcrawler.apache.org",
"stormcrawler.apache.org.html");
diff --git a/core/src/test/resources/longtext.html
b/core/src/test/resources/longtext.html
index d4005d40..758ccc51 100644
--- a/core/src/test/resources/longtext.html
+++ b/core/src/test/resources/longtext.html
@@ -24,13 +24,13 @@ under the License.
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
- <title>Apache StormCrawler (Incubating)</title>
- <meta name="description" content="Apache StormCrawler (Incubating) is
collection of resources for building low-latency, scalable web crawlers on
Apache Storm
+ <title>Apache StormCrawler </title>
+ <meta name="description" content="Apache StormCrawler is collection of
resources for building low-latency, scalable web crawlers on Apache Storm
">
<link rel="stylesheet" href="/css/main.css">
<link rel="canonical" href="https://stormcrawler.apache.org/">
- <link rel="alternate" type="application/rss+xml" title="Apache
StormCrawler (Incubating)" href="https://stormcrawler.apache.org/feed.xml">
+ <link rel="alternate" type="application/rss+xml" title="Apache
StormCrawler " href="https://stormcrawler.apache.org/feed.xml">
<link rel="icon" type="/image/png" href="/img/favicon.png" />
</head>
@@ -40,7 +40,7 @@ under the License.
<header class="site-header">
<div class="site-header__wrap">
<div class="site-header__logo">
- <a href="/"><img src="/img/incubator_logo.png" alt="Apache
StormCrawler (Incubating)"></a>
+ <a href="/"><img src="/img/incubator_logo.png" alt="Apache
StormCrawler "></a>
</div>
</div>
</header>
@@ -48,7 +48,7 @@ under the License.
<ul>
<li><a href="/index.html">Home</a>
<li><a href="/download/index.html">Download</a>
- <li><a href="https://github.com/apache/incubator-stormcrawler">Source
Code</a></li>
+ <li><a href="https://github.com/apache/stormcrawler">Source
Code</a></li>
<li><a href="/getting-started/">Getting Started</a></li>
<li><a
href="https://javadoc.io/doc/org.apache.stormcrawler/stormcrawler-core/3.1.0/index.html">JavaDocs</a>
<li><a href="/faq/">FAQ</a></li>
@@ -63,8 +63,8 @@ under the License.
</div>
</div>
<div class="row row-col">
- <p><strong>Apache StormCrawler (Incubating)</strong> is an open source
SDK for building distributed web crawlers based on <a
href="http://storm.apache.org">Apache Storm®</a>. The project is under Apache
license v2 and consists of a collection of reusable resources and components,
written mostly in Java.</p>
- <p>The aim of Apache StormCrawler (Incubating) is to help build web
crawlers that are :</p>
+ <p><strong>Apache StormCrawler </strong> is an open source SDK for
building distributed web crawlers based on <a
href="http://storm.apache.org">Apache Storm®</a>. The project is under Apache
license v2 and consists of a collection of reusable resources and components,
written mostly in Java.</p>
+ <p>The aim of Apache StormCrawler is to help build web crawlers that
are :</p>
<ul>
<li>scalable</li>
<li>resilient</li>
@@ -72,10 +72,10 @@ under the License.
<li>easy to extend</li>
<li>polite yet efficient</li>
</ul>
- <p><strong>Apache StormCrawler (Incubating)</strong> is a library and
collection of resources that developers can leverage to build their own
crawlers. The good news is that doing so can be pretty straightforward! Have a
look at the <a href="getting-started/">Getting Started</a> section for more
details.</p>
- <p>Apart from the core components, we provide some <a
href="https://github.com/apache/incubator-stormcrawler/tree/main/external">external
resources</a> that you can reuse in your project, like for instance our spout
and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt
which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various
document formats.</p>
- <p><strong>Apache StormCrawler (Incubating)</strong> is perfectly
suited to use cases where the URL to fetch and parse come as streams but is
also an appropriate solution for large scale recursive crawls, particularly
where low latency is required. The project is used in production by <a
href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By">many
organisations</a> and is actively developed and maintained.</p>
- <p>The <a
href="https://github.com/apache/incubator-stormcrawler/wiki/Presentations">Presentations</a>
page contains links to some recent presentations made about this project.</p>
+ <p><strong>Apache StormCrawler </strong> is a library and collection
of resources that developers can leverage to build their own crawlers. The good
news is that doing so can be pretty straightforward! Have a look at the <a
href="getting-started/">Getting Started</a> section for more details.</p>
+ <p>Apart from the core components, we provide some <a
href="https://github.com/apache/stormcrawler/tree/main/external">external
resources</a> that you can reuse in your project, like for instance our spout
and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt
which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various
document formats.</p>
+ <p><strong>Apache StormCrawler </strong> is perfectly suited to use
cases where the URL to fetch and parse come as streams but is also an
appropriate solution for large scale recursive crawls, particularly where low
latency is required. The project is used in production by <a
href="https://github.com/apache/stormcrawler/wiki/Powered-By">many
organisations</a> and is actively developed and maintained.</p>
+ <p>The <a
href="https://github.com/apache/stormcrawler/wiki/Presentations">Presentations</a>
page contains links to some recent presentations made about this project.</p>
</div>
<div class="row row-col">
@@ -94,7 +94,7 @@ under the License.
<img src="/img/polecat.svg" alt="Polecat" height=70>
</a>
<br>
- <a
href="http://github.com/apache/incubator-stormcrawler/wiki/Powered-By">and many
more...</a>
+ <a
href="http://github.com/apache/stormcrawler/wiki/Powered-By">and many
more...</a>
</div>
<article>
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam
nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam
voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita
kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem
ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et ju [...]
diff --git a/core/src/test/resources/stackexception.html
b/core/src/test/resources/stackexception.html
index c17ea9db..cb7e0395 100644
--- a/core/src/test/resources/stackexception.html
+++ b/core/src/test/resources/stackexception.html
@@ -24,13 +24,13 @@ under the License.
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
- <title>Apache StormCrawler (Incubating)</title>
- <meta name="description" content="Apache StormCrawler (Incubating) is
collection of resources for building low-latency, scalable web crawlers on
Apache Storm
+ <title>Apache StormCrawler </title>
+ <meta name="description" content="Apache StormCrawler is collection of
resources for building low-latency, scalable web crawlers on Apache Storm
">
<link rel="stylesheet" href="/css/main.css">
<link rel="canonical" href="https://stormcrawler.apache.org/">
- <link rel="alternate" type="application/rss+xml" title="Apache
StormCrawler (Incubating)" href="https://stormcrawler.apache.org/feed.xml">
+ <link rel="alternate" type="application/rss+xml" title="Apache
StormCrawler " href="https://stormcrawler.apache.org/feed.xml">
<link rel="icon" type="/image/png" href="/img/favicon.png" />
</head>
@@ -40,7 +40,7 @@ under the License.
<header class="site-header">
<div class="site-header__wrap">
<div class="site-header__logo">
- <a href="/"><img src="/img/incubator_logo.png" alt="Apache
StormCrawler (Incubating)"></a>
+ <a href="/"><img src="/img/incubator_logo.png" alt="Apache
StormCrawler "></a>
</div>
</div>
</header>
@@ -48,7 +48,7 @@ under the License.
<ul>
<li><a href="/index.html">Home</a>
<li><a href="/download/index.html">Download</a>
- <li><a href="https://github.com/apache/incubator-stormcrawler">Source
Code</a></li>
+ <li><a href="https://github.com/apache/stormcrawler">Source
Code</a></li>
<li><a href="/getting-started/">Getting Started</a></li>
<li><a
href="https://javadoc.io/doc/org.apache.stormcrawler/stormcrawler-core/3.1.0/index.html">JavaDocs</a>
<li><a href="/faq/">FAQ</a></li>
@@ -63,8 +63,8 @@ under the License.
</div>
</div>
<div class="row row-col">
- <p><strong>Apache StormCrawler (Incubating)</strong> is an open source
SDK for building distributed web crawlers based on <a
href="http://storm.apache.org">Apache Storm®</a>. The project is under Apache
license v2 and consists of a collection of reusable resources and components,
written mostly in Java.</p>
- <p>The aim of Apache StormCrawler (Incubating) is to help build web
crawlers that are :</p>
+ <p><strong>Apache StormCrawler </strong> is an open source SDK for
building distributed web crawlers based on <a
href="http://storm.apache.org">Apache Storm®</a>. The project is under Apache
license v2 and consists of a collection of reusable resources and components,
written mostly in Java.</p>
+ <p>The aim of Apache StormCrawler is to help build web crawlers that
are :</p>
<ul>
<li>scalable</li>
<li>resilient</li>
@@ -72,10 +72,10 @@ under the License.
<li>easy to extend</li>
<li>polite yet efficient</li>
</ul>
- <p><strong>Apache StormCrawler (Incubating)</strong> is a library and
collection of resources that developers can leverage to build their own
crawlers. The good news is that doing so can be pretty straightforward! Have a
look at the <a href="getting-started/">Getting Started</a> section for more
details.</p>
- <p>Apart from the core components, we provide some <a
href="https://github.com/apache/incubator-stormcrawler/tree/main/external">external
resources</a> that you can reuse in your project, like for instance our spout
and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt
which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various
document formats.</p>
- <p><strong>Apache StormCrawler (Incubating)</strong> is perfectly
suited to use cases where the URL to fetch and parse come as streams but is
also an appropriate solution for large scale recursive crawls, particularly
where low latency is required. The project is used in production by <a
href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By">many
organisations</a> and is actively developed and maintained.</p>
- <p>The <a
href="https://github.com/apache/incubator-stormcrawler/wiki/Presentations">Presentations</a>
page contains links to some recent presentations made about this project.</p>
+ <p><strong>Apache StormCrawler </strong> is a library and collection
of resources that developers can leverage to build their own crawlers. The good
news is that doing so can be pretty straightforward! Have a look at the <a
href="getting-started/">Getting Started</a> section for more details.</p>
+ <p>Apart from the core components, we provide some <a
href="https://github.com/apache/stormcrawler/tree/main/external">external
resources</a> that you can reuse in your project, like for instance our spout
and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt
which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various
document formats.</p>
+ <p><strong>Apache StormCrawler </strong> is perfectly suited to use
cases where the URL to fetch and parse come as streams but is also an
appropriate solution for large scale recursive crawls, particularly where low
latency is required. The project is used in production by <a
href="https://github.com/apache/stormcrawler/wiki/Powered-By">many
organisations</a> and is actively developed and maintained.</p>
+ <p>The <a
href="https://github.com/apache/stormcrawler/wiki/Presentations">Presentations</a>
page contains links to some recent presentations made about this project.</p>
</div>
<div class="row row-col">
@@ -94,7 +94,7 @@ under the License.
<img src="/img/polecat.svg" alt="Polecat" height=70>
</a>
<br>
- <a
href="http://github.com/apache/incubator-stormcrawler/wiki/Powered-By">and many
more...</a>
+ <a
href="http://github.com/apache/stormcrawler/wiki/Powered-By">and many
more...</a>
</div>
</div>
diff --git a/core/src/test/resources/stormcrawler.apache.org.html
b/core/src/test/resources/stormcrawler.apache.org.html
index 455fa938..7d9f2744 100644
--- a/core/src/test/resources/stormcrawler.apache.org.html
+++ b/core/src/test/resources/stormcrawler.apache.org.html
@@ -24,12 +24,12 @@ under the License.
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
- <title>Apache StormCrawler (Incubating)</title>
- <meta name="description" content="Apache StormCrawler (Incubating) is
collection of resources for building low-latency, scalable web crawlers on
Apache Storm">
+ <title>Apache StormCrawler </title>
+ <meta name="description" content="Apache StormCrawler is collection of
resources for building low-latency, scalable web crawlers on Apache Storm">
<meta name="keywords" content="crawl, information extraction, information
retrieval, NLP, IR, IE, nutch, solr">
<link rel="stylesheet" href="/css/main.css">
<link rel="canonical" href="https://stormcrawler.apache.org/">
- <link rel="alternate" type="application/rss+xml" title="Apache
StormCrawler (Incubating)" href="https://stormcrawler.apache.org/feed.xml">
+ <link rel="alternate" type="application/rss+xml" title="Apache
StormCrawler " href="https://stormcrawler.apache.org/feed.xml">
<link rel="icon" type="/image/png" href="/img/favicon.png" />
</head>
@@ -53,7 +53,7 @@ under the License.
<header class="site-header">
<div class="site-header__wrap">
<div class="site-header__logo">
- <a href="/"><img src="/img/incubator_logo.png" alt="Apache
StormCrawler (Incubating)"></a>
+ <a href="/"><img src="/img/incubator_logo.png" alt="Apache
StormCrawler "></a>
</div>
</div>
</header>
@@ -61,7 +61,7 @@ under the License.
<ul>
<li><a href="/index.html">Home</a>
<li><a href="/download/index.html">Download</a>
- <li><a href="https://github.com/apache/incubator-stormcrawler">Source
Code</a></li>
+ <li><a href="https://github.com/apache/stormcrawler">Source
Code</a></li>
<li><a href="/getting-started/">Getting Started</a></li>
<li><a
href="https://javadoc.io/doc/org.apache.stormcrawler/stormcrawler-core/3.1.0/index.html">JavaDocs</a>
<li><a href="/faq/">FAQ</a></li>
@@ -76,8 +76,8 @@ under the License.
</div>
</div>
<div class="row row-col">
- <p><strong><span class="concept">Apache StormCrawler
(Incubating)</span></strong> is an open source SDK for building distributed web
crawlers based on <a href="http://storm.apache.org">Apache Storm®</a>. The
project is under Apache license v2 and consists of a collection of reusable
resources and components, written mostly in Java.</p>
- <p>The aim of Apache StormCrawler (Incubating) is to help build web
crawlers that are :</p>
+ <p><strong><span class="concept">Apache StormCrawler </span></strong>
is an open source SDK for building distributed web crawlers based on <a
href="http://storm.apache.org">Apache Storm®</a>. The project is under Apache
license v2 and consists of a collection of reusable resources and components,
written mostly in Java.</p>
+ <p>The aim of Apache StormCrawler is to help build web crawlers that
are :</p>
<ul>
<li>scalable</li>
<li>resilient</li>
@@ -85,10 +85,10 @@ under the License.
<li>easy to extend</li>
<li>polite yet efficient</li>
</ul>
- <p><strong>Apache StormCrawler (Incubating)</strong> is a library and
collection of resources that developers can leverage to build their own
crawlers. The good news is that doing so can be pretty straightforward! Have a
look at the <a href="getting-started/">Getting Started</a> section for more
details.</p>
- <p>Apart from the core components, we provide some <a
href="https://github.com/apache/incubator-stormcrawler/tree/main/external">external
resources</a> that you can reuse in your project, like for instance our spout
and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt
which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various
document formats.</p>
- <p><strong>Apache StormCrawler (Incubating)</strong> is perfectly
suited to use cases where the URL to fetch and parse come as streams but is
also an appropriate solution for large scale recursive crawls, particularly
where low latency is required. The project is used in production by <a
href="https://github.com/apache/incubator-stormcrawler/wiki/Powered-By">many
organisations</a> and is actively developed and maintained.</p>
- <p>The <a
href="https://github.com/apache/incubator-stormcrawler/wiki/Presentations">Presentations</a>
page contains links to some recent presentations made about this project.</p>
+ <p><strong>Apache StormCrawler </strong> is a library and collection
of resources that developers can leverage to build their own crawlers. The good
news is that doing so can be pretty straightforward! Have a look at the <a
href="getting-started/">Getting Started</a> section for more details.</p>
+ <p>Apart from the core components, we provide some <a
href="https://github.com/apache/stormcrawler/tree/main/external">external
resources</a> that you can reuse in your project, like for instance our spout
and bolts for <a href="https://opensearch.org/">OpenSearch®</a> or a ParserBolt
which uses <a href="http://tika.apache.org">Apache Tika®</a> to parse various
document formats.</p>
+ <p><strong>Apache StormCrawler </strong> is perfectly suited to use
cases where the URL to fetch and parse come as streams but is also an
appropriate solution for large scale recursive crawls, particularly where low
latency is required. The project is used in production by <a
href="https://github.com/apache/stormcrawler/wiki/Powered-By">many
organisations</a> and is actively developed and maintained.</p>
+ <p>The <a
href="https://github.com/apache/stormcrawler/wiki/Presentations">Presentations</a>
page contains links to some recent presentations made about this project.</p>
</div>
<div class="row row-col">
@@ -107,7 +107,7 @@ under the License.
<img src="/img/polecat.svg" alt="Polecat" height=70>
</a>
<br>
- <a
href="http://github.com/apache/incubator-stormcrawler/wiki/Powered-By">and many
more...</a>
+ <a
href="http://github.com/apache/stormcrawler/wiki/Powered-By">and many
more...</a>
</div>
</div>
diff --git a/external/aws/pom.xml b/external/aws/pom.xml
index a902e49a..d6201542 100644
--- a/external/aws/pom.xml
+++ b/external/aws/pom.xml
@@ -33,7 +33,7 @@ under the License.
<packaging>jar</packaging>
<name>stormcrawler-aws</name>
-
<url>https://github.com/apache/incubator-stormcrawler/tree/master/external/aws</url>
+
<url>https://github.com/apache/stormcrawler/tree/master/external/aws</url>
<description>AWS resources for StormCrawler</description>
<properties>
diff --git a/external/langid/pom.xml b/external/langid/pom.xml
index bfb29296..118d7cd7 100644
--- a/external/langid/pom.xml
+++ b/external/langid/pom.xml
@@ -33,7 +33,7 @@ under the License.
<packaging>jar</packaging>
<name>stormcrawler-langid</name>
-
<url>https://github.com/apache/incubator-stormcrawler/tree/master/external/langid</url>
+
<url>https://github.com/apache/stormcrawler/tree/master/external/langid</url>
<description>Language Identification for StormCrawler</description>
<dependencies>
diff --git a/external/opensearch/README.md b/external/opensearch/README.md
index 2cc5e2f3..63faf71c 100644
--- a/external/opensearch/README.md
+++ b/external/opensearch/README.md
@@ -2,10 +2,10 @@ stormcrawler-opensearch
===========================
A collection of resources for [OpenSearch](https://opensearch.org/):
-*
[IndexerBolt](https://github.com/apache/incubator-stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java)
for indexing documents crawled with StormCrawler
-*
[Spouts](https://github.com/apache/incubator-stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/AggregationSpout.java)
and
[StatusUpdaterBolt](https://github.com/apache/incubator-stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java)
for persisting URL information in recursive crawls
-*
[MetricsConsumer](https://github.com/apache/incubator-stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/metrics/MetricsConsumer.java)
-*
[StatusMetricsBolt](https://github.com/apache/incubator-stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/metrics/StatusMetricsBolt.java)
for sending the breakdown of URLs per status as metrics and display its
evolution over time.
+*
[IndexerBolt](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java)
for indexing documents crawled with StormCrawler
+*
[Spouts](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/AggregationSpout.java)
and
[StatusUpdaterBolt](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java)
for persisting URL information in recursive crawls
+*
[MetricsConsumer](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/metrics/MetricsConsumer.java)
+*
[StatusMetricsBolt](https://github.com/apache/stormcrawler/blob/master/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/metrics/StatusMetricsBolt.java)
for sending the breakdown of URLs per status as metrics and display its
evolution over time.
as well as resources for building basic real-time monitoring dashboards for
the crawls, see below.
diff --git
a/external/opensearch/archetype/src/main/resources/archetype-resources/README.md
b/external/opensearch/archetype/src/main/resources/archetype-resources/README.md
index 526788f4..98825846 100644
---
a/external/opensearch/archetype/src/main/resources/archetype-resources/README.md
+++
b/external/opensearch/archetype/src/main/resources/archetype-resources/README.md
@@ -60,7 +60,7 @@ The file _storm.ndjson_ is used to display some of Storm's
internal metrics and
-Happy crawling! If you have any questions, please ask on [StackOverflow with
the tag stormcrawler](http://stackoverflow.com/questions/tagged/stormcrawler)
or the
[discussions](https://github.com/apache/incubator-stormcrawler/discussions)
section on GitHub.
+Happy crawling! If you have any questions, please ask on [StackOverflow with
the tag stormcrawler](http://stackoverflow.com/questions/tagged/stormcrawler)
or the [discussions](https://github.com/apache/stormcrawler/discussions)
section on GitHub.
diff --git a/external/opensearch/pom.xml b/external/opensearch/pom.xml
index d3dfcb9f..a5aa711d 100644
--- a/external/opensearch/pom.xml
+++ b/external/opensearch/pom.xml
@@ -45,7 +45,7 @@ under the License.
<name>stormcrawler-opensearch</name>
<url>
-
https://github.com/apache/incubator-stormcrawler/tree/master/external/opensearch</url>
+
https://github.com/apache/stormcrawler/tree/master/external/opensearch</url>
<description>Opensearch resources for StormCrawler</description>
<build>
diff --git
a/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/DeletionBolt.java
b/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/DeletionBolt.java
index d90c4c69..2fee97fe 100644
---
a/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/DeletionBolt.java
+++
b/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/DeletionBolt.java
@@ -196,7 +196,7 @@ public class DeletionBolt extends BaseRichBolt
return new
BulkItemResponseToFailedFlag(bir, failed);
})
.collect(
- //
https://github.com/apache/incubator-stormcrawler/issues/832
+ //
https://github.com/apache/stormcrawler/issues/832
Collectors.groupingBy(
idWithFailedFlagTuple ->
idWithFailedFlagTuple.id,
Collectors.toUnmodifiableList()));
diff --git
a/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java
b/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java
index 183bf15e..f03efc67 100644
---
a/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java
+++
b/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/bolt/IndexerBolt.java
@@ -306,7 +306,7 @@ public class IndexerBolt extends AbstractIndexerBolt
return new
BulkItemResponseToFailedFlag(bir, failed);
})
.collect(
- //
https://github.com/apache/incubator-stormcrawler/issues/832
+ //
https://github.com/apache/stormcrawler/issues/832
Collectors.groupingBy(
idWithFailedFlagTuple ->
idWithFailedFlagTuple.id,
Collectors.toUnmodifiableList()));
diff --git
a/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java
b/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java
index a7708db3..d4ce11ca 100644
---
a/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java
+++
b/external/opensearch/src/main/java/org/apache/stormcrawler/opensearch/persistence/StatusUpdaterBolt.java
@@ -339,7 +339,7 @@ public class StatusUpdaterBolt extends
AbstractStatusUpdaterBolt
return new
BulkItemResponseToFailedFlag(bir, failed);
})
.collect(
- //
https://github.com/apache/incubator-stormcrawler/issues/832
+ //
https://github.com/apache/stormcrawler/issues/832
Collectors.groupingBy(
idWithFailedFlagTuple ->
idWithFailedFlagTuple.id,
Collectors.toUnmodifiableList()));
diff --git
a/external/opensearch/src/test/java/org/apache/stormcrawler/opensearch/bolt/IndexerBoltTest.java
b/external/opensearch/src/test/java/org/apache/stormcrawler/opensearch/bolt/IndexerBoltTest.java
index a53047da..cf6203e1 100644
---
a/external/opensearch/src/test/java/org/apache/stormcrawler/opensearch/bolt/IndexerBoltTest.java
+++
b/external/opensearch/src/test/java/org/apache/stormcrawler/opensearch/bolt/IndexerBoltTest.java
@@ -114,7 +114,7 @@ class IndexerBoltTest extends AbstractOpenSearchTest {
@Test
@Timeout(value = 2, unit = TimeUnit.MINUTES)
- // https://github.com/apache/incubator-stormcrawler/issues/832
+ // https://github.com/apache/stormcrawler/issues/832
void simultaneousCanonicals()
throws ExecutionException, InterruptedException, TimeoutException {
Metadata m1 = new Metadata();
diff --git
a/external/opensearch/src/test/java/org/apache/stormcrawler/opensearch/bolt/StatusBoltTest.java
b/external/opensearch/src/test/java/org/apache/stormcrawler/opensearch/bolt/StatusBoltTest.java
index 6e738b0c..b95b5838 100644
---
a/external/opensearch/src/test/java/org/apache/stormcrawler/opensearch/bolt/StatusBoltTest.java
+++
b/external/opensearch/src/test/java/org/apache/stormcrawler/opensearch/bolt/StatusBoltTest.java
@@ -129,7 +129,7 @@ class StatusBoltTest extends AbstractOpenSearchTest {
@Test
@Timeout(value = 2, unit = TimeUnit.MINUTES)
- // see https://github.com/apache/incubator-stormcrawler/issues/885
+ // see https://github.com/apache/stormcrawler/issues/885
void checkListKeyFromOpensearch()
throws IOException, ExecutionException, InterruptedException,
TimeoutException {
String url = "https://www.url.net/something";
diff --git a/external/playwright/README.md b/external/playwright/README.md
index b0d6fdb4..4cbb1d13 100644
--- a/external/playwright/README.md
+++ b/external/playwright/README.md
@@ -1,5 +1,5 @@
# Playwright
-Protocol implementation for Apache StormCrawler (Incubating) based on
Playwright
+Protocol implementation for Apache StormCrawler based on Playwright
## Standalone Chrome
diff --git a/external/playwright/pom.xml b/external/playwright/pom.xml
index 52900115..d8e74315 100644
--- a/external/playwright/pom.xml
+++ b/external/playwright/pom.xml
@@ -33,7 +33,7 @@ under the License.
<packaging>jar</packaging>
<name>stormcrawler-playwright</name>
-
<url>https://github.com/apache/incubator-stormcrawler/tree/master/external/playwright</url>
+
<url>https://github.com/apache/stormcrawler/tree/master/external/playwright</url>
<description>Playwright-based protocol for StormCrawler</description>
<properties>
diff --git a/external/solr/README.md b/external/solr/README.md
index 85a78c25..edb1722e 100644
--- a/external/solr/README.md
+++ b/external/solr/README.md
@@ -20,11 +20,11 @@ Official references:
## Available resources
-*
[IndexerBolt](https://github.com/apache/incubator-stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/bolt/IndexerBolt.java):
Implementation of
[AbstractIndexerBolt](https://github.com/apache/incubator-stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java)
that allows to index the parsed data and metadata into a specified Solr
collection.
+*
[IndexerBolt](https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/bolt/IndexerBolt.java):
Implementation of
[AbstractIndexerBolt](https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/indexing/AbstractIndexerBolt.java)
that allows to index the parsed data and metadata into a specified Solr
collection.
-*
[MetricsConsumer](https://github.com/apache/incubator-stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/metrics/MetricsConsumer.java):
Class that allows to store Storm metrics in Solr.
+*
[MetricsConsumer](https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/metrics/MetricsConsumer.java):
Class that allows to store Storm metrics in Solr.
-*
[SolrSpout](https://github.com/apache/incubator-stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/persistence/SolrSpout.java):
Spout that allows to get URLs from a specified Solr collection.
+*
[SolrSpout](https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/persistence/SolrSpout.java):
Spout that allows to get URLs from a specified Solr collection.
-*
[StatusUpdaterBolt](https://github.com/apache/incubator-stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/persistence/StatusUpdaterBolt.java):
Implementation of
[AbstractStatusUpdaterBolt](https://github.com/apache/incubator-stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java)
that allows to store the status of each URL along with the serialized metadata
in Solr.
+*
[StatusUpdaterBolt](https://github.com/apache/stormcrawler/blob/main/external/solr/src/main/java/org/apache/stormcrawler/solr/persistence/StatusUpdaterBolt.java):
Implementation of
[AbstractStatusUpdaterBolt](https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/persistence/AbstractStatusUpdaterBolt.java)
that allows to store the status of each URL along with the serialized metadata
in Solr.
diff --git
a/external/solr/archetype/src/main/resources/archetype-resources/README.md
b/external/solr/archetype/src/main/resources/archetype-resources/README.md
index 291487d6..a367fff1 100644
--- a/external/solr/archetype/src/main/resources/archetype-resources/README.md
+++ b/external/solr/archetype/src/main/resources/archetype-resources/README.md
@@ -51,7 +51,7 @@ solr.status.bucket.field: host
solr.status.bucket.maxsize: 100
```
-This feature can be combined with the [partition
features](https://github.com/apache/incubator-stormcrawler/wiki/Configuration#fetching-and-partitioning)
provided by StormCrawler to balance the crawling process and not just the URL
coverage.
+This feature can be combined with the [partition
features](https://github.com/apache/stormcrawler/wiki/Configuration#fetching-and-partitioning)
provided by StormCrawler to balance the crawling process and not just the URL
coverage.
> It is recommended to use Solr in cloud mode. The following configuration
> options are available for distributing the `status` collection across
> multiple shards.
> * `solr.status.routing.fieldname`: Field to be used for routing documents to
> different shards. The values depend on the `partition.url.mode` (`byHost`,
> `byDomain`, `byIP`)
diff --git a/external/solr/pom.xml b/external/solr/pom.xml
index 6a12def7..32bf57d6 100644
--- a/external/solr/pom.xml
+++ b/external/solr/pom.xml
@@ -34,7 +34,7 @@ under the License.
<name>stormcrawler-solr</name>
<url>
-
https://github.com/apache/incubator-stormcrawler/tree/master/external/solr</url>
+
https://github.com/apache/stormcrawler/tree/master/external/solr</url>
<description>Solr resources for StormCrawler</description>
<properties>
diff --git a/external/sql/README.md b/external/sql/README.md
index 880e477a..f5400731 100644
--- a/external/sql/README.md
+++ b/external/sql/README.md
@@ -2,9 +2,9 @@
Contains a spout implementation as well as a status updater bolt and a
MetricsConsumer.
-The
[tableCreation.script](https://github.com/apache/incubator-stormcrawler/blob/main/external/sql/tableCreation.script)
is based on MySQL and is used for the creation of the tables.
+The
[tableCreation.script](https://github.com/apache/stormcrawler/blob/main/external/sql/tableCreation.script)
is based on MySQL and is used for the creation of the tables.
-Check that you have specified a configuration file such as
[sql-conf.yaml](https://github.com/apache/incubator-stormcrawler/blob/master/external/sql/sql-conf.yaml)
and have a Java driver in the dependencies of your POM
+Check that you have specified a configuration file such as
[sql-conf.yaml](https://github.com/apache/stormcrawler/blob/master/external/sql/sql-conf.yaml)
and have a Java driver in the dependencies of your POM
```
<dependency>
diff --git a/external/sql/pom.xml b/external/sql/pom.xml
index 3716b993..d53af174 100644
--- a/external/sql/pom.xml
+++ b/external/sql/pom.xml
@@ -33,7 +33,7 @@ under the License.
<packaging>jar</packaging>
<name>stormcrawler-sql</name>
-
<url>https://github.com/apache/incubator-stormcrawler/tree/master/external/sql</url>
+
<url>https://github.com/apache/stormcrawler/tree/master/external/sql</url>
<description>SQL-based resources for StormCrawler</description>
<dependencies>
diff --git a/external/tika/README.md b/external/tika/README.md
index b3703722..99d3ca6f 100644
--- a/external/tika/README.md
+++ b/external/tika/README.md
@@ -4,7 +4,7 @@ Contains a bolt implementation which uses [Apache
Tika](http://tika.apache.org/)
To use it alongside the JSoup parser i.e. let JSoup handle HTML content and
Tika do everything else, you need to configure the JSoupParser with
`jsoup.treat.non.html.as.error: false` so that documents that are not HTML
don't get failed but passed on.
-The next step is to use a
[RedirectionBolt](https://github.com/apache/incubator-stormcrawler/blob/master/external/tika/src/main/java/org/apache/stormcrawler/tika/RedirectionBolt.java)
to send documents which have not been parsed with Jsoup to Tika on a bespoke
stream called `tika`, finally the IndexingBolt needs to be connected to the
outputs of both `shunt` and `tika` on the default stream. `tika` must also be
connected to the StatusUpdaterBolt on the _status_ stream.
+The next step is to use a
[RedirectionBolt](https://github.com/apache/stormcrawler/blob/master/external/tika/src/main/java/org/apache/stormcrawler/tika/RedirectionBolt.java)
to send documents which have not been parsed with Jsoup to Tika on a bespoke
stream called `tika`, finally the IndexingBolt needs to be connected to the
outputs of both `shunt` and `tika` on the default stream. `tika` must also be
connected to the StatusUpdaterBolt on the _status_ stream.
```
builder.setBolt("jsoup", new JSoupParserBolt()).localOrShuffleGrouping(
diff --git a/external/tika/pom.xml b/external/tika/pom.xml
index 1b67824f..0e397bf5 100644
--- a/external/tika/pom.xml
+++ b/external/tika/pom.xml
@@ -33,7 +33,7 @@ under the License.
<packaging>jar</packaging>
<name>stormcrawler-tika</name>
-
<url>https://github.com/apache/incubator-stormcrawler/tree/master/external/tika</url>
+
<url>https://github.com/apache/stormcrawler/tree/master/external/tika</url>
<description>Tika-based parser bolt for StormCrawler</description>
<properties>
diff --git
a/external/tika/src/test/java/org/apache/stormcrawler/tika/ParserBoltTest.java
b/external/tika/src/test/java/org/apache/stormcrawler/tika/ParserBoltTest.java
index c728d1a4..3833491c 100644
---
a/external/tika/src/test/java/org/apache/stormcrawler/tika/ParserBoltTest.java
+++
b/external/tika/src/test/java/org/apache/stormcrawler/tika/ParserBoltTest.java
@@ -74,7 +74,7 @@ class ParserBoltTest extends ParsingTester {
/**
* Checks that the mimetype whitelists are handled correctly
*
- * @see https://github.com/apache/incubator-stormcrawler/issues/712
+ * @see https://github.com/apache/stormcrawler/issues/712
*/
void testMimeTypeWhileList() throws IOException {
Map conf = new HashMap();
diff --git a/external/urlfrontier/pom.xml b/external/urlfrontier/pom.xml
index 80c92575..3a9654e9 100644
--- a/external/urlfrontier/pom.xml
+++ b/external/urlfrontier/pom.xml
@@ -33,7 +33,7 @@ under the License.
<packaging>jar</packaging>
<name>stormcrawler-urlfrontier</name>
-
<url>https://github.com/apache/incubator-stormcrawler/tree/master/external/urlfrontier</url>
+
<url>https://github.com/apache/stormcrawler/tree/master/external/urlfrontier</url>
<description>URL Frontier resources for StormCrawler</description>
<properties>
diff --git
a/external/urlfrontier/src/main/java/org/apache/stormcrawler/urlfrontier/ManagedChannelUtil.java
b/external/urlfrontier/src/main/java/org/apache/stormcrawler/urlfrontier/ManagedChannelUtil.java
index 360b04a8..8ca34747 100644
---
a/external/urlfrontier/src/main/java/org/apache/stormcrawler/urlfrontier/ManagedChannelUtil.java
+++
b/external/urlfrontier/src/main/java/org/apache/stormcrawler/urlfrontier/ManagedChannelUtil.java
@@ -27,7 +27,7 @@ import org.slf4j.LoggerFactory;
/*
* At some point we have to write a mechanism to share the same ManagedChannel
in the same runtime
- * see:
https://github.com/apache/incubator-stormcrawler/pull/982#issuecomment-1175272094
+ * see: https://github.com/apache/stormcrawler/pull/982#issuecomment-1175272094
*/
final class ManagedChannelUtil {
private ManagedChannelUtil() {}
diff --git a/external/warc/pom.xml b/external/warc/pom.xml
index cc3b0d6e..9e8c1545 100644
--- a/external/warc/pom.xml
+++ b/external/warc/pom.xml
@@ -33,7 +33,7 @@ under the License.
<packaging>jar</packaging>
<name>stormcrawler-warc</name>
-
<url>https://github.com/apache/incubator-stormcrawler/tree/master/external/warc</url>
+
<url>https://github.com/apache/stormcrawler/tree/master/external/warc</url>
<description>WARC resources for StormCrawler</description>
<properties>
diff --git
a/external/warc/src/main/java/org/apache/stormcrawler/warc/WARCRequestRecordFormat.java
b/external/warc/src/main/java/org/apache/stormcrawler/warc/WARCRequestRecordFormat.java
index d8c8ec66..18ff0427 100644
---
a/external/warc/src/main/java/org/apache/stormcrawler/warc/WARCRequestRecordFormat.java
+++
b/external/warc/src/main/java/org/apache/stormcrawler/warc/WARCRequestRecordFormat.java
@@ -74,7 +74,7 @@ public class WARCRequestRecordFormat extends WARCRecordFormat
{
/*
* The request record ID is stored in the metadata so that a WARC
* response record can later refer to it. Deactivated because of
- * https://github.com/apache/incubator-stormcrawler/issues/721
+ * https://github.com/apache/stormcrawler/issues/721
*/
// metadata.setValue("_request.warc_record_id_", mainID);
diff --git a/pom.xml b/pom.xml
index 5d60f88a..4de38f83 100644
--- a/pom.xml
+++ b/pom.xml
@@ -36,7 +36,7 @@ under the License.
<name>stormcrawler</name>
<description>A collection of resources for building low-latency,
scalable
web crawlers on Apache Storm.</description>
- <url>https://github.com/apache/incubator-stormcrawler</url>
+ <url>https://github.com/apache/stormcrawler</url>
<licenses>
<license>
@@ -46,16 +46,16 @@ under the License.
</licenses>
<scm>
-
<connection>scm:git:https://github.com/apache/incubator-stormcrawler.git</connection>
+
<connection>scm:git:https://github.com/apache/stormcrawler.git</connection>
<developerConnection>
-
scm:git:[email protected]:apache/incubator-stormcrawler.git</developerConnection>
- <url>https://github.com/apache/incubator-stormcrawler</url>
+
scm:git:[email protected]:apache/stormcrawler.git</developerConnection>
+ <url>https://github.com/apache/stormcrawler</url>
<tag>HEAD</tag>
</scm>
<issueManagement>
<system>GitHub Issues</system>
-
<url>https://github.com/apache/incubator-stormcrawler/issues</url>
+ <url>https://github.com/apache/stormcrawler/issues</url>
</issueManagement>
<properties>
@@ -267,7 +267,7 @@ under the License.
<id>process-resource-bundles</id>
<configuration>
<properties>
-
<projectName>Apache StormCrawler (incubating)</projectName>
+
<projectName>Apache StormCrawler</projectName>
</properties>
<resourceBundles>
<resourceBundle>org.apache.apache.resources:apache-jar-resource-bundle:${apache-jar-resource-bundle.version}</resourceBundle>