Thanks! I did some more searching and found that the issue in my case
was that commons-compress-1.27.1
(/opt/apache-storm-2.8.2/lib/commons-compress-1.27.1.jar) was ending in
the classpath :-(
When i changes the storm lib to 1.28.0 the issue was fixed. I have no
idea though why I am the only one experiencing this issue.
Markos
On 9/11/25 18:11, Richard Zowalla wrote:
I will try to reproduce it in the evening with the snippets / sample project
and steps you have provided :-)
Am 11. September 2025 17:09:40 MESZ schrieb Markos Volikas
<[email protected]>:
I have attached it. It only contains 1.28.0, but my maven repository has many
versions that were fetched when building SC from source and I don't understand
why this happens to be honest.
I'm also not completely sure what happens when submitting the jar since storm
itself depends on another version of compress..
/opt/apache-storm-2.8.2/bin/storm local target/test-1.0-SNAPSHOT.jar
org.apache.storm.flux.Flux crawler.flux --local-ttl 3600
I hope this is not a silly mistake and I'm wasting your time :-)
On 9/11/25 17:56, Richard Zowalla wrote:
What does your mvn dependency:tree tell? :-)
The only thing that needs to be cleaned is the locally installed SC.
Am 11. September 2025 16:48:53 MESZ schrieb Markos Volikas
<[email protected]>:
Yes..
I'm building from source using:
https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC2/
(tar.gz)
I completely removed
/home/markos/.m2/repository/org/apache/commons/commons-compress and then ran
mvn clean install and it seems that multiple versions are getting in.
Before this I had also removed my .m2/ completely to make sure all dependencies
are downloaded and they did. I have attached the build log.
markos@nombat:~/.m2/repository/org/apache/commons/commons-compress$ ll
total 28
drwxrwxr-x 7 markos markos 4096 Sep 11 17:42 ./
drwxrwxr-x 12 markos markos 4096 Sep 11 17:42 ../
drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.20/
drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.26.1/
drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.26.2/
drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.27.1/
drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.28.0/
Markos
On 9/11/25 16:55, Richard Zowalla wrote:
Cleaned your local Maven repo before building the uber jar?
Can you check your compress version?
Gruß
Richard
Am 11. September 2025 15:38:38 MESZ schrieb Markos Volikas
<[email protected]>:
Hi all,
I'm afraid I'm still getting:
16:25:13.829 [Thread-46-parse-executor[6, 6]] INFO o.a.s.b.JSoupParserBolt -
Parsing : starting https://apache.org/
16:25:13.848 [Thread-46-parse-executor[6, 6]] ERROR o.a.s.b.JSoupParserBolt -
Exception while guessing mimetype on https://apache.org/:
org.apache.commons.compress.archivers.ArchiveException: No Archiver found for
the stream signature
I'm running in local mode with Storm 2.8.2 running on Ubuntu 24.04 (openjdk
17.0.16 2025-07-15). The database is Solr running in Docker although this
should be irrelevant. Maybe I'm doing something wrong? I have attached the
config I'm using in case you have any ideas. Sorry for the delay, but I just
found time to look into this again :-(
Markos
On 9/8/25 20:46, Richard Zowalla wrote:
Hi folks,
I have posted a 2nd release candidate for the Apache StormCrawler 3.5.0 release
and it is ready for testing. The regression with Tika / Compress was fixed.
Apache StormCrawler 3.5.0 decouples Selenium from the core module, improving
modularity and reducing unnecessary dependencies.
The release also introduces an advanced metadata filtering systemt hat supports complex
logical operations like key=>val OR (key2=>val2 AND key3=>val3).
Additionally, multiple dependencies were upgraded, core tests improved, and
deprecated code cleaned up, enhancing overall stability and maintainability.
Thank you to everyone who contributed to this release, including all of our
users and the people who submitted bug reports,
contributed code or documentation enhancements.
The release was made using the Apache StormCrawler release process, documented
here:
https://github.com/apache/stormcrawler/blob/main/RELEASING.md
Source:
https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC
<https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC1>2
Tag:
https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0
Commit Hash:
1947ad4c56ff5c5c90e093900a163e0ac3144bb6
Maven Repo:
https://repository.apache.org/content/repositories/orgapachestormcrawler-1011
<repositories>
<repository>
<id>stormcrawler-3.5.0-rc2</id>
<name>Testing StormCrawler 3.5.0 release candidate 2</name>
<url>
https://repository.apache.org/content/repositories/orgapachestormcrawler-1011
</url>
</repository>
</repositories>
Release notes:
https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0
Reminder: The up-2-date KEYS file for signature verification can be
found here: https://downloads.apache.org/stormcrawler/KEYS
Please vote on releasing these packages as Apache StormCrawler 3.5.0
The vote is open for at least the next 72 hours.
Only votes from the StormCrawler PMC are binding, but everyone is welcome to
check the release candidate and vote.
The vote passes if at least three binding +1 votes are cast.
Please VOTE
[+1] go ship it
[+0] meh, don't care
[-1] stop, there is a ${showstopper}
Thanks!
Richard