Hi all, I'm afraid I'm still getting:
16:25:13.829 [Thread-46-parse-executor[6, 6]] INFO o.a.s.b.JSoupParserBolt - Parsing : starting https://apache.org/ 16:25:13.848 [Thread-46-parse-executor[6, 6]] ERROR o.a.s.b.JSoupParserBolt - Exception while guessing mimetype on https://apache.org/: org.apache.commons.compress.archivers.ArchiveException: No Archiver found for the stream signature
I'm running in local mode with Storm 2.8.2 running on Ubuntu 24.04 (openjdk 17.0.16 2025-07-15). The database is Solr running in Docker although this should be irrelevant. Maybe I'm doing something wrong? I have attached the config I'm using in case you have any ideas. Sorry for the delay, but I just found time to look into this again :-(
Markos On 9/8/25 20:46, Richard Zowalla wrote:
Hi folks, I have posted a 2nd release candidate for the Apache StormCrawler 3.5.0 release and it is ready for testing. The regression with Tika / Compress was fixed. Apache StormCrawler 3.5.0 decouples Selenium from the core module, improving modularity and reducing unnecessary dependencies. The release also introduces an advanced metadata filtering systemt hat supports complex logical operations like key=>val OR (key2=>val2 AND key3=>val3). Additionally, multiple dependencies were upgraded, core tests improved, and deprecated code cleaned up, enhancing overall stability and maintainability. Thank you to everyone who contributed to this release, including all of our users and the people who submitted bug reports, contributed code or documentation enhancements. The release was made using the Apache StormCrawler release process, documented here: https://github.com/apache/stormcrawler/blob/main/RELEASING.md Source: https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC <https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC1>2 Tag: https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0 Commit Hash: 1947ad4c56ff5c5c90e093900a163e0ac3144bb6 Maven Repo: https://repository.apache.org/content/repositories/orgapachestormcrawler-1011 <repositories> <repository> <id>stormcrawler-3.5.0-rc2</id> <name>Testing StormCrawler 3.5.0 release candidate 2</name> <url> https://repository.apache.org/content/repositories/orgapachestormcrawler-1011 </url> </repository> </repositories> Release notes: https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0 Reminder: The up-2-date KEYS file for signature verification can be found here: https://downloads.apache.org/stormcrawler/KEYS Please vote on releasing these packages as Apache StormCrawler 3.5.0 The vote is open for at least the next 72 hours. Only votes from the StormCrawler PMC are binding, but everyone is welcome to check the release candidate and vote. The vote passes if at least three binding +1 votes are cast. Please VOTE [+1] go ship it [+0] meh, don't care [-1] stop, there is a ${showstopper} Thanks! Richard
<<attachment: markos-config-2025-09-11.zip>>