Thanks! I did some more searching and found that the issue in my case was that commons-compress-1.27.1 (/opt/apache-storm-2.8.2/lib/commons-compress-1.27.1.jar) was ending in the classpath :-(

When i changes the storm lib to 1.28.0 the issue was fixed. I have no idea though why I am the only one experiencing this issue.

Markos

On 9/11/25 18:11, Richard Zowalla wrote:
I will try to reproduce it in the evening with the snippets / sample project 
and steps you have provided :-)

Am 11. September 2025 17:09:40 MESZ schrieb Markos Volikas 
<[email protected]>:
I have attached it. It only contains 1.28.0, but my maven repository has many 
versions that were fetched when building SC from source and I don't understand 
why this happens to be honest.

I'm also not completely sure what happens when submitting the jar since storm 
itself depends on another version of compress..

/opt/apache-storm-2.8.2/bin/storm local target/test-1.0-SNAPSHOT.jar 
org.apache.storm.flux.Flux crawler.flux --local-ttl 3600

I hope this is not a silly mistake and I'm wasting your time :-)

On 9/11/25 17:56, Richard Zowalla wrote:
What does your mvn dependency:tree tell? :-)

The only thing that needs to be cleaned is the locally installed SC.



Am 11. September 2025 16:48:53 MESZ schrieb Markos Volikas 
<[email protected]>:
Yes..

I'm building from source using: 
https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC2/ 
(tar.gz)

I completely removed 
/home/markos/.m2/repository/org/apache/commons/commons-compress and then ran 
mvn clean install and it seems that multiple versions are getting in.

Before this I had also removed my .m2/ completely to make sure all dependencies 
are downloaded and they did. I have attached the build log.

markos@nombat:~/.m2/repository/org/apache/commons/commons-compress$ ll
total 28
drwxrwxr-x  7 markos markos 4096 Sep 11 17:42 ./
drwxrwxr-x 12 markos markos 4096 Sep 11 17:42 ../
drwxrwxr-x  2 markos markos 4096 Sep 11 17:42 1.20/
drwxrwxr-x  2 markos markos 4096 Sep 11 17:42 1.26.1/
drwxrwxr-x  2 markos markos 4096 Sep 11 17:42 1.26.2/
drwxrwxr-x  2 markos markos 4096 Sep 11 17:42 1.27.1/
drwxrwxr-x  2 markos markos 4096 Sep 11 17:42 1.28.0/

Markos

On 9/11/25 16:55, Richard Zowalla wrote:
Cleaned your local Maven repo before building the uber jar?

Can you check your compress version?

Gruß
Richard

Am 11. September 2025 15:38:38 MESZ schrieb Markos Volikas 
<[email protected]>:
Hi all,

I'm afraid I'm still getting:

16:25:13.829 [Thread-46-parse-executor[6, 6]] INFO  o.a.s.b.JSoupParserBolt - 
Parsing : starting https://apache.org/
16:25:13.848 [Thread-46-parse-executor[6, 6]] ERROR o.a.s.b.JSoupParserBolt - 
Exception while guessing mimetype on https://apache.org/: 
org.apache.commons.compress.archivers.ArchiveException: No Archiver found for 
the stream signature

I'm running in local mode with Storm 2.8.2 running on Ubuntu 24.04 (openjdk 
17.0.16 2025-07-15). The database is Solr running in Docker although this 
should be irrelevant. Maybe I'm doing something wrong? I have attached the 
config I'm using in case you have any ideas. Sorry for the delay, but I just 
found time to look into this again :-(

Markos

On 9/8/25 20:46, Richard Zowalla wrote:
Hi folks,

I have posted a 2nd release candidate for the Apache StormCrawler 3.5.0 release 
and it is ready for testing. The regression with Tika / Compress was fixed.

Apache StormCrawler 3.5.0 decouples Selenium from the core module, improving 
modularity and reducing unnecessary dependencies.
The release also introduces an advanced metadata filtering systemt hat supports complex 
logical operations like key=>val OR (key2=>val2 AND key3=>val3).
Additionally, multiple dependencies were upgraded, core tests improved, and 
deprecated code cleaned up, enhancing overall stability and maintainability.

Thank you to everyone who contributed to this release, including all of our 
users and the people who submitted bug reports,
contributed code or documentation enhancements.

The release was made using the Apache StormCrawler release process, documented 
here:
https://github.com/apache/stormcrawler/blob/main/RELEASING.md

Source:

https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC 
<https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC1>2

Tag:

https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0

Commit Hash:

1947ad4c56ff5c5c90e093900a163e0ac3144bb6

Maven Repo:

https://repository.apache.org/content/repositories/orgapachestormcrawler-1011

<repositories>
<repository>
<id>stormcrawler-3.5.0-rc2</id>
<name>Testing StormCrawler 3.5.0 release candidate 2</name>
<url>
https://repository.apache.org/content/repositories/orgapachestormcrawler-1011
</url>
</repository>
</repositories>

Release notes:

https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0

Reminder: The up-2-date KEYS file for signature verification can be
found here: https://downloads.apache.org/stormcrawler/KEYS

Please vote on releasing these packages as Apache StormCrawler 3.5.0
The vote is open for at least the next 72 hours.

Only votes from the StormCrawler PMC are binding, but everyone is welcome to 
check the release candidate and vote.
The vote passes if at least three binding +1 votes are cast.

Please VOTE

[+1] go ship it
[+0] meh, don't care
[-1] stop, there is a ${showstopper}

Thanks!
Richard

Reply via email to