Re: [I] Tika's detector breaks mime type guessing [stormcrawler]

via GitHub Sun, 07 Sep 2025 07:41:48 -0700


tballison commented on issue #1650:
URL: https://github.com/apache/stormcrawler/issues/1650#issuecomment-3263821646


   The linked issue explains that Tika expects the most recent version of
   commons-compress, or at least whatever version we depended on in 3.2.2.
   
   Commons-compress used to throw an ArchiverException, and Tika used to catch
   that specific exception. In the latest compress, ArchiverException began
   extending IOException. Tika simplified its catch to IOException. If there’s
   an older version of compress on the class path, you’ll get this problem.
   
   If you back off to MimeTypes…detect(), you’ll miss all container detectors
   and only rely on magic bytes. For some use cases, this is sufficient, but
   it might cause surprises and certainly will lead to lower precision
   detection.
   
   On Sun, Sep 7, 2025 at 10:21 AM Markos Volikas ***@***.***>
   wrote:
   
   > *mvolikas* created an issue (apache/stormcrawler#1650)
   > <https://github.com/apache/stormcrawler/issues/1650>
   > Version
   >
   > main branch
   > Describe what's wrong
   >
   > Tika's detector (used by JsoupParserBolt
   > 
<https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java#L500>)
   > seems to treat every file as potentially being an archive file and then
   > fails because it actually isn't.
   > Related to https://issues.apache.org/jira/browse/TIKA-4469.
   > Comment by @rzo1 <https://github.com/rzo1> in dev list:
   >
   > In the end, we should not use Tika's Detector but a TikaInputStream
   > instead like that:
   > try (TikaInputStream tis = TikaInputStream.get(data)) { final Metadata
   > metadata = new Metadata();
   > metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, file.getFileName());
   > final MediaType mediaType = MimeTypes.getDefaultMimeTypes().detect(tis,
   > metadata);
   >
   > Error message and/or stacktrace
   >
   > Exception while guessing mimetype on https://apache.org/:
   > org.apache.commons.compress.archivers.ArchiveException: No Archiver found
   > for the stream signature
   >
   > How to reproduce
   >
   > Run a crawl with the single seed URL "https://apache.org/";.
   > Additional context
   >
   > *No response*
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/stormcrawler/issues/1650>, or unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/ABTNNPTT6GU6GBO5HUTTSEL3RQ5M3AVCNFSM6AAAAACF3BVZDGVHI2DSMVQWIX3LMV43ASLTON2WKOZTGM4TCNRWHE3DCOA>
   > .
   > You are receiving this because you are subscribed to this thread.Message
   > ID: ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Tika's detector breaks mime type guessing [stormcrawler]

Reply via email to