I will try to reproduce it in the evening with the snippets / sample project and steps you have provided :-)
Am 11. September 2025 17:09:40 MESZ schrieb Markos Volikas <[email protected]>: >I have attached it. It only contains 1.28.0, but my maven repository has many >versions that were fetched when building SC from source and I don't understand >why this happens to be honest. > >I'm also not completely sure what happens when submitting the jar since storm >itself depends on another version of compress.. > >/opt/apache-storm-2.8.2/bin/storm local target/test-1.0-SNAPSHOT.jar >org.apache.storm.flux.Flux crawler.flux --local-ttl 3600 > >I hope this is not a silly mistake and I'm wasting your time :-) > >On 9/11/25 17:56, Richard Zowalla wrote: >> What does your mvn dependency:tree tell? :-) >> >> The only thing that needs to be cleaned is the locally installed SC. >> >> >> >> Am 11. September 2025 16:48:53 MESZ schrieb Markos Volikas >> <[email protected]>: >>> Yes.. >>> >>> I'm building from source using: >>> https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC2/ >>> (tar.gz) >>> >>> I completely removed >>> /home/markos/.m2/repository/org/apache/commons/commons-compress and then >>> ran mvn clean install and it seems that multiple versions are getting in. >>> >>> Before this I had also removed my .m2/ completely to make sure all >>> dependencies are downloaded and they did. I have attached the build log. >>> >>> markos@nombat:~/.m2/repository/org/apache/commons/commons-compress$ ll >>> total 28 >>> drwxrwxr-x 7 markos markos 4096 Sep 11 17:42 ./ >>> drwxrwxr-x 12 markos markos 4096 Sep 11 17:42 ../ >>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.20/ >>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.26.1/ >>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.26.2/ >>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.27.1/ >>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.28.0/ >>> >>> Markos >>> >>> On 9/11/25 16:55, Richard Zowalla wrote: >>>> Cleaned your local Maven repo before building the uber jar? >>>> >>>> Can you check your compress version? >>>> >>>> Gruß >>>> Richard >>>> >>>> Am 11. September 2025 15:38:38 MESZ schrieb Markos Volikas >>>> <[email protected]>: >>>>> Hi all, >>>>> >>>>> I'm afraid I'm still getting: >>>>> >>>>> 16:25:13.829 [Thread-46-parse-executor[6, 6]] INFO >>>>> o.a.s.b.JSoupParserBolt - Parsing : starting https://apache.org/ >>>>> 16:25:13.848 [Thread-46-parse-executor[6, 6]] ERROR >>>>> o.a.s.b.JSoupParserBolt - Exception while guessing mimetype on >>>>> https://apache.org/: >>>>> org.apache.commons.compress.archivers.ArchiveException: No Archiver found >>>>> for the stream signature >>>>> >>>>> I'm running in local mode with Storm 2.8.2 running on Ubuntu 24.04 >>>>> (openjdk 17.0.16 2025-07-15). The database is Solr running in Docker >>>>> although this should be irrelevant. Maybe I'm doing something wrong? I >>>>> have attached the config I'm using in case you have any ideas. Sorry for >>>>> the delay, but I just found time to look into this again :-( >>>>> >>>>> Markos >>>>> >>>>> On 9/8/25 20:46, Richard Zowalla wrote: >>>>>> Hi folks, >>>>>> >>>>>> I have posted a 2nd release candidate for the Apache StormCrawler 3.5.0 >>>>>> release and it is ready for testing. The regression with Tika / Compress >>>>>> was fixed. >>>>>> >>>>>> Apache StormCrawler 3.5.0 decouples Selenium from the core module, >>>>>> improving modularity and reducing unnecessary dependencies. >>>>>> The release also introduces an advanced metadata filtering systemt hat >>>>>> supports complex logical operations like key=>val OR (key2=>val2 AND >>>>>> key3=>val3). >>>>>> Additionally, multiple dependencies were upgraded, core tests improved, >>>>>> and deprecated code cleaned up, enhancing overall stability and >>>>>> maintainability. >>>>>> >>>>>> Thank you to everyone who contributed to this release, including all of >>>>>> our users and the people who submitted bug reports, >>>>>> contributed code or documentation enhancements. >>>>>> >>>>>> The release was made using the Apache StormCrawler release process, >>>>>> documented here: >>>>>> https://github.com/apache/stormcrawler/blob/main/RELEASING.md >>>>>> >>>>>> Source: >>>>>> >>>>>> https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC >>>>>> >>>>>> <https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC1>2 >>>>>> >>>>>> Tag: >>>>>> >>>>>> https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0 >>>>>> >>>>>> Commit Hash: >>>>>> >>>>>> 1947ad4c56ff5c5c90e093900a163e0ac3144bb6 >>>>>> >>>>>> Maven Repo: >>>>>> >>>>>> https://repository.apache.org/content/repositories/orgapachestormcrawler-1011 >>>>>> >>>>>> <repositories> >>>>>> <repository> >>>>>> <id>stormcrawler-3.5.0-rc2</id> >>>>>> <name>Testing StormCrawler 3.5.0 release candidate 2</name> >>>>>> <url> >>>>>> https://repository.apache.org/content/repositories/orgapachestormcrawler-1011 >>>>>> </url> >>>>>> </repository> >>>>>> </repositories> >>>>>> >>>>>> Release notes: >>>>>> >>>>>> https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0 >>>>>> >>>>>> Reminder: The up-2-date KEYS file for signature verification can be >>>>>> found here: https://downloads.apache.org/stormcrawler/KEYS >>>>>> >>>>>> Please vote on releasing these packages as Apache StormCrawler 3.5.0 >>>>>> The vote is open for at least the next 72 hours. >>>>>> >>>>>> Only votes from the StormCrawler PMC are binding, but everyone is >>>>>> welcome to check the release candidate and vote. >>>>>> The vote passes if at least three binding +1 votes are cast. >>>>>> >>>>>> Please VOTE >>>>>> >>>>>> [+1] go ship it >>>>>> [+0] meh, don't care >>>>>> [-1] stop, there is a ${showstopper} >>>>>> >>>>>> Thanks! >>>>>> Richard
