The following change to MimeUtil.java seems to solve my problem:
// magicType = tika.detect(data);
try {
InputStream in = new ByteArrayInputStream(data);
Metadata meta = new Metadata();
magicType = this.mimeTypes.detect(in, meta).toString();
LOG.debug("Magic Type for" + url + " is " + magicType);
} catch (Exception e) {
//Can't complete magic detection
}
However, my confidence that I haven’t broken something else is modest at best.
If this looks like a bug I am happy to create the JIRA entry and submit this as
a patch, but before I do so can you tell me if this looks sensible?
-----Original Message-----
From: Iain Lopata [mailto:[email protected]]
Sent: Tuesday, April 14, 2015 8:43 PM
To: [email protected]
Subject: RE: Mimetype detection for JSON
It seems to me that setting tika-mimetypes.xml in the Nutch configuration
causes MimeUtil.java to use the specified file for initial lookup and for URL
resolution. However, when it comes to magic detection, the tika-mimetypes.xml
file in the Tika jar file seems to be used instead.
If I update the Tika jar with my match rule it works perfectly. If I only place
the updated tika-mimetypes.xml file in my Nutch configuration directory, the
magic detection does not use my match rule.
Can anyone familiar with the Tika implementation tell me if there is a way to
update Nutch's MimeUtil.java to instantiate Tika to use the configuration file
from Nutch? Or would it be better just to update the configuration file in the
Tika jar?
-----Original Message-----
From: Iain Lopata [mailto:[email protected]]
Sent: Tuesday, April 14, 2015 5:32 PM
To: [email protected]
Subject: RE: Mimetype detection for JSON
Thanks Sebastian.
mime.type.magic is true.
I don’t have control over the web server, so cannot test with
application/javascript
Time for some deeper debugging it seems. Will update the list with findings.
-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]]
Sent: Tuesday, April 14, 2015 4:09 PM
To: [email protected]
Subject: Re: Mimetype detection for JSON
Hi Iain,
> I have copied tika-mimetypes.xml from the tika jar file and installed
> a copy in my configuration directory. I have updated nutch-site.xml
> to point to this file and the log entries indicate that this is being found.
... and the property mime.type.magic is true (default)?
> <mime-type type="application/json">
> <sub-class-of type="application/javascript"/>
Just as a trial: What happens if you make the web server return
"application/javascript"
as content type?
> I am still getting the content type detected as text/html and the json
> parser is not being invoked. Any suggestions as to what to look at next?
The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the
following resources to Tika:
- byte stream for magic detection
- URL for additional file name patterns
- content type sent by server
URL and server content type are required as additional hints, e.g., for zip
containers such as .xlsx, etc.
I fear that you have to run a debugger to find out what is going wrong.
I would also run first Tika alone with the modified tika-mimetypes.xml, just to
make sure that the mime magic works as expected.
Cheers,
Sebastian
On 04/13/2015 04:26 PM, Iain Lopata wrote:
> I have a page that I am fetching that contains JSON and I have a
> plugin for parsing JSON.
>
>
>
> The server sets a mimetype of "text/html" and consequently my json
> parser does not get invoked.
>
>
>
> If I run parsechecker from the command line and specify -forceAs
> "application/json" the json parser is invoked and works successfully.
>
>
>
> So, I believe that if I can get tika to give me "application/json" as
> the detected content type for this page, it should work during a crawl.
>
>
>
> I have copied tika-mimetypes.xml from the tika jar file and installed
> a copy in my configuration directory. I have updated nutch-site.xml
> to point to this file and the log entries indicate that this is being found.
>
>
>
> In my copy of tika-mimetypes.xml I have added the match rule shown
> below
>
>
>
> <mime-type type="application/json">
>
> <sub-class-of type="application/javascript"/>
>
> <magic priority="100">
>
> <match value="{" type="string" offset="0"/>
>
> </magic>
>
> <glob pattern="*.json"/>
>
> </mime-type>
>
>
>
> I know that my match is much too broad, but I am using this just while
> trying to resolve this problem.
>
>
>
> I have also set lang.extraction.policy to identify in nutch-site.xml
> (again primarily for testing purposes).
>
>
>
> I am still getting the content type detected as text/html and the json
> parser is not being invoked. Any suggestions as to what to look at next?
>
>
>
> Thanks!
>
>
>
> Iain
>
>