I'm sorry for this mess. Tika 3.2.3 is now under vote [0]. That allows for backward compatibility with compress < 1.28.0.
Dependency mgmt is important, but there are better ways to identify issues than this. :( [0] https://lists.apache.org/thread/px1stbwnbgx301y4sg6yxycrmcqt27gf On Thu, Sep 11, 2025 at 11:59 AM Markos Volikas <[email protected]> wrote: > > Thanks! I did some more searching and found that the issue in my case > was that commons-compress-1.27.1 > (/opt/apache-storm-2.8.2/lib/commons-compress-1.27.1.jar) was ending in > the classpath :-( > > When i changes the storm lib to 1.28.0 the issue was fixed. I have no > idea though why I am the only one experiencing this issue. > > Markos > > On 9/11/25 18:11, Richard Zowalla wrote: > > I will try to reproduce it in the evening with the snippets / sample > > project and steps you have provided :-) > > > > Am 11. September 2025 17:09:40 MESZ schrieb Markos Volikas > > <[email protected]>: > >> I have attached it. It only contains 1.28.0, but my maven repository has > >> many versions that were fetched when building SC from source and I don't > >> understand why this happens to be honest. > >> > >> I'm also not completely sure what happens when submitting the jar since > >> storm itself depends on another version of compress.. > >> > >> /opt/apache-storm-2.8.2/bin/storm local target/test-1.0-SNAPSHOT.jar > >> org.apache.storm.flux.Flux crawler.flux --local-ttl 3600 > >> > >> I hope this is not a silly mistake and I'm wasting your time :-) > >> > >> On 9/11/25 17:56, Richard Zowalla wrote: > >>> What does your mvn dependency:tree tell? :-) > >>> > >>> The only thing that needs to be cleaned is the locally installed SC. > >>> > >>> > >>> > >>> Am 11. September 2025 16:48:53 MESZ schrieb Markos Volikas > >>> <[email protected]>: > >>>> Yes.. > >>>> > >>>> I'm building from source using: > >>>> https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC2/ > >>>> (tar.gz) > >>>> > >>>> I completely removed > >>>> /home/markos/.m2/repository/org/apache/commons/commons-compress and then > >>>> ran mvn clean install and it seems that multiple versions are getting in. > >>>> > >>>> Before this I had also removed my .m2/ completely to make sure all > >>>> dependencies are downloaded and they did. I have attached the build log. > >>>> > >>>> markos@nombat:~/.m2/repository/org/apache/commons/commons-compress$ ll > >>>> total 28 > >>>> drwxrwxr-x 7 markos markos 4096 Sep 11 17:42 ./ > >>>> drwxrwxr-x 12 markos markos 4096 Sep 11 17:42 ../ > >>>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.20/ > >>>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.26.1/ > >>>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.26.2/ > >>>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.27.1/ > >>>> drwxrwxr-x 2 markos markos 4096 Sep 11 17:42 1.28.0/ > >>>> > >>>> Markos > >>>> > >>>> On 9/11/25 16:55, Richard Zowalla wrote: > >>>>> Cleaned your local Maven repo before building the uber jar? > >>>>> > >>>>> Can you check your compress version? > >>>>> > >>>>> Gruß > >>>>> Richard > >>>>> > >>>>> Am 11. September 2025 15:38:38 MESZ schrieb Markos Volikas > >>>>> <[email protected]>: > >>>>>> Hi all, > >>>>>> > >>>>>> I'm afraid I'm still getting: > >>>>>> > >>>>>> 16:25:13.829 [Thread-46-parse-executor[6, 6]] INFO > >>>>>> o.a.s.b.JSoupParserBolt - Parsing : starting https://apache.org/ > >>>>>> 16:25:13.848 [Thread-46-parse-executor[6, 6]] ERROR > >>>>>> o.a.s.b.JSoupParserBolt - Exception while guessing mimetype on > >>>>>> https://apache.org/: > >>>>>> org.apache.commons.compress.archivers.ArchiveException: No Archiver > >>>>>> found for the stream signature > >>>>>> > >>>>>> I'm running in local mode with Storm 2.8.2 running on Ubuntu 24.04 > >>>>>> (openjdk 17.0.16 2025-07-15). The database is Solr running in Docker > >>>>>> although this should be irrelevant. Maybe I'm doing something wrong? I > >>>>>> have attached the config I'm using in case you have any ideas. Sorry > >>>>>> for the delay, but I just found time to look into this again :-( > >>>>>> > >>>>>> Markos > >>>>>> > >>>>>> On 9/8/25 20:46, Richard Zowalla wrote: > >>>>>>> Hi folks, > >>>>>>> > >>>>>>> I have posted a 2nd release candidate for the Apache StormCrawler > >>>>>>> 3.5.0 release and it is ready for testing. The regression with Tika / > >>>>>>> Compress was fixed. > >>>>>>> > >>>>>>> Apache StormCrawler 3.5.0 decouples Selenium from the core module, > >>>>>>> improving modularity and reducing unnecessary dependencies. > >>>>>>> The release also introduces an advanced metadata filtering systemt > >>>>>>> hat supports complex logical operations like key=>val OR (key2=>val2 > >>>>>>> AND key3=>val3). > >>>>>>> Additionally, multiple dependencies were upgraded, core tests > >>>>>>> improved, and deprecated code cleaned up, enhancing overall stability > >>>>>>> and maintainability. > >>>>>>> > >>>>>>> Thank you to everyone who contributed to this release, including all > >>>>>>> of our users and the people who submitted bug reports, > >>>>>>> contributed code or documentation enhancements. > >>>>>>> > >>>>>>> The release was made using the Apache StormCrawler release process, > >>>>>>> documented here: > >>>>>>> https://github.com/apache/stormcrawler/blob/main/RELEASING.md > >>>>>>> > >>>>>>> Source: > >>>>>>> > >>>>>>> https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC > >>>>>>> > >>>>>>> <https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC1>2 > >>>>>>> > >>>>>>> Tag: > >>>>>>> > >>>>>>> https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0 > >>>>>>> > >>>>>>> Commit Hash: > >>>>>>> > >>>>>>> 1947ad4c56ff5c5c90e093900a163e0ac3144bb6 > >>>>>>> > >>>>>>> Maven Repo: > >>>>>>> > >>>>>>> https://repository.apache.org/content/repositories/orgapachestormcrawler-1011 > >>>>>>> > >>>>>>> <repositories> > >>>>>>> <repository> > >>>>>>> <id>stormcrawler-3.5.0-rc2</id> > >>>>>>> <name>Testing StormCrawler 3.5.0 release candidate 2</name> > >>>>>>> <url> > >>>>>>> https://repository.apache.org/content/repositories/orgapachestormcrawler-1011 > >>>>>>> </url> > >>>>>>> </repository> > >>>>>>> </repositories> > >>>>>>> > >>>>>>> Release notes: > >>>>>>> > >>>>>>> https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0 > >>>>>>> > >>>>>>> Reminder: The up-2-date KEYS file for signature verification can be > >>>>>>> found here: https://downloads.apache.org/stormcrawler/KEYS > >>>>>>> > >>>>>>> Please vote on releasing these packages as Apache StormCrawler 3.5.0 > >>>>>>> The vote is open for at least the next 72 hours. > >>>>>>> > >>>>>>> Only votes from the StormCrawler PMC are binding, but everyone is > >>>>>>> welcome to check the release candidate and vote. > >>>>>>> The vote passes if at least three binding +1 votes are cast. > >>>>>>> > >>>>>>> Please VOTE > >>>>>>> > >>>>>>> [+1] go ship it > >>>>>>> [+0] meh, don't care > >>>>>>> [-1] stop, there is a ${showstopper} > >>>>>>> > >>>>>>> Thanks! > >>>>>>> Richard
