Thanks Sebastian. mime.type.magic is true.
I don’t have control over the web server, so cannot test with application/javascript Time for some deeper debugging it seems. Will update the list with findings. -----Original Message----- From: Sebastian Nagel [mailto:[email protected]] Sent: Tuesday, April 14, 2015 4:09 PM To: [email protected] Subject: Re: Mimetype detection for JSON Hi Iain, > I have copied tika-mimetypes.xml from the tika jar file and installed > a copy in my configuration directory. I have updated nutch-site.xml > to point to this file and the log entries indicate that this is being found. ... and the property mime.type.magic is true (default)? > <mime-type type="application/json"> > <sub-class-of type="application/javascript"/> Just as a trial: What happens if you make the web server return "application/javascript" as content type? > I am still getting the content type detected as text/html and the json > parser is not being invoked. Any suggestions as to what to look at next? The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the following resources to Tika: - byte stream for magic detection - URL for additional file name patterns - content type sent by server URL and server content type are required as additional hints, e.g., for zip containers such as .xlsx, etc. I fear that you have to run a debugger to find out what is going wrong. I would also run first Tika alone with the modified tika-mimetypes.xml, just to make sure that the mime magic works as expected. Cheers, Sebastian On 04/13/2015 04:26 PM, Iain Lopata wrote: > I have a page that I am fetching that contains JSON and I have a > plugin for parsing JSON. > > > > The server sets a mimetype of "text/html" and consequently my json > parser does not get invoked. > > > > If I run parsechecker from the command line and specify -forceAs > "application/json" the json parser is invoked and works successfully. > > > > So, I believe that if I can get tika to give me "application/json" as > the detected content type for this page, it should work during a crawl. > > > > I have copied tika-mimetypes.xml from the tika jar file and installed > a copy in my configuration directory. I have updated nutch-site.xml > to point to this file and the log entries indicate that this is being found. > > > > In my copy of tika-mimetypes.xml I have added the match rule shown > below > > > > <mime-type type="application/json"> > > <sub-class-of type="application/javascript"/> > > <magic priority="100"> > > <match value="{" type="string" offset="0"/> > > </magic> > > <glob pattern="*.json"/> > > </mime-type> > > > > I know that my match is much too broad, but I am using this just while > trying to resolve this problem. > > > > I have also set lang.extraction.policy to identify in nutch-site.xml > (again primarily for testing purposes). > > > > I am still getting the content type detected as text/html and the json > parser is not being invoked. Any suggestions as to what to look at next? > > > > Thanks! > > > > Iain > >

