Hi,
Thanks for testing. I think, that this actually warrants a release re-roll,
because SC would be broken if metadata detection is enabled.
Can you open an issue for it? We should post pone the release until we fixed
that.
I remember that exception from another project. In the end, we should not use
Tika's Detector but a TikaInputStream instead like that:
try (TikaInputStream tis = TikaInputStream.get(data)) { final Metadata metadata
= new Metadata(); metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY,
file.getFileName()); final MediaType mediaType =
MimeTypes.getDefaultMimeTypes().detect(tis, metadata);
Gruß
Richard
Am 7. September 2025 15:50:06 MESZ schrieb Markos Volikas <[email protected]>:
>Hi everyone,
>
>Hash and building from source are ok.
>
>However, when running a crawl with the single seed "https://apache.org/", I'm
>getting the following error from the JsoupParserBolt:
>
>"Exception while guessing mimetype on https://apache.org/:
>org.apache.commons.compress.archivers.ArchiveException: No Archiver found for
>the stream signature"
>
>This was not the case for stormcrawler-3.4.0. It seems to be caused by Tika's
>detector when we do MediaType mt = detector.detect(stream, metadata);
>
>Markos
>
>On 9/6/25 11:52, Richard Zowalla wrote:
>> Hi folks,
>>
>> I have posted a first release candidate for the Apache StormCrawler 3.5.0
>> release and it is ready for testing.
>>
>> Apache StormCrawler 3.5.0 decouples Selenium from the core module, improving
>> modularity and reducing unnecessary dependencies.
>> The release also introduces an advanced metadata filtering systemt hat
>> supports complex logical operations like key=>val OR (key2=>val2 AND
>> key3=>val3).
>> Additionally, multiple dependencies were upgraded, core tests improved, and
>> deprecated code cleaned up, enhancing overall stability and maintainability.
>>
>> Thank you to everyone who contributed to this release, including all of our
>> users and the people who submitted bug reports,
>> contributed code or documentation enhancements.
>>
>> The release was made using the Apache StormCrawler release process,
>> documented here:
>> https://github.com/apache/stormcrawler/blob/main/RELEASING.md
>>
>> Source:
>>
>> https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.5.0-RC1
>> <https://dist.apache.org/repos/dist/dev/stormcrawler/stormcrawler-3.-RC1>
>>
>> Tag:
>>
>> https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0
>>
>> Commit Hash:
>>
>> 8d517ad6c6da32fc307106f8b0b9de4b6df48585
>>
>> Maven Repo:
>>
>> https://repository.apache.org/content/repositories/orgapachestormcrawler-1009
>>
>> <repositories>
>> <repository>
>> <id>stormcrawler-3.5.0-rc1</id>
>> <name>Testing StormCrawler 3.5.0 release candidate</name>
>> <url>
>> https://repository.apache.org/content/repositories/orgapachestormcrawler-1009
>> </url>
>> </repository>
>> </repositories>
>>
>> Release notes:
>>
>> https://github.com/apache/stormcrawler/releases/tag/stormcrawler-3.5.0
>>
>> Reminder: The up-2-date KEYS file for signature verification can be
>> found here: https://downloads.apache.org/stormcrawler/KEYS
>>
>> Please vote on releasing these packages as Apache StormCrawler 3.5.0
>> The vote is open for at least the next 72 hours.
>>
>> Only votes from the StormCrawler PMC are binding, but everyone is welcome to
>> check the release candidate and vote.
>> The vote passes if at least three binding +1 votes are cast.
>>
>> Please VOTE
>>
>> [+1] go ship it
>> [+0] meh, don't care
>> [-1] stop, there is a ${showstopper}
>>
>> Thanks!
>> Richard