Thanks Sebastian.

mime.type.magic is true.

I don’t have control over the web server, so cannot test with 
application/javascript

Time for some deeper debugging it seems.  Will update the list with findings.

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]] 
Sent: Tuesday, April 14, 2015 4:09 PM
To: [email protected]
Subject: Re: Mimetype detection for JSON

Hi Iain,

> I have copied tika-mimetypes.xml from the tika jar file and installed 
> a copy in my configuration directory.  I have updated nutch-site.xml 
> to point to this file and the log entries indicate that this is being found.

... and the property mime.type.magic is true (default)?


> <mime-type type="application/json">
>           <sub-class-of type="application/javascript"/>

Just as a trial: What happens if you make the web server return 
"application/javascript"
as content type?


> I am still getting the content type detected as text/html and the json 
> parser is not being invoked.  Any suggestions as to what to look at next?

The mime magic is done by Tika. Nutch (o.a.n.util.MimeUtil) passes the 
following resources to Tika:
- byte stream for magic detection
- URL for additional file name patterns
- content type sent by server
URL and server content type are required as additional hints, e.g., for zip 
containers such as .xlsx, etc.

I fear that you have to run a debugger to find out what is going wrong.
I would also run first Tika alone with the modified tika-mimetypes.xml, just to 
make sure that the mime magic works as expected.

Cheers,
Sebastian

On 04/13/2015 04:26 PM, Iain Lopata wrote:
> I have a page that I am fetching that contains JSON and I have a 
> plugin for parsing JSON.
> 
>  
> 
> The server sets a mimetype of "text/html" and consequently my json 
> parser does not get invoked.
> 
>  
> 
> If I run parsechecker from the command line and specify -forceAs 
> "application/json" the json parser is invoked and works successfully.
> 
>  
> 
> So, I believe that if I can get tika to give me "application/json" as 
> the detected content type for this page, it should work during a crawl.
> 
>  
> 
> I have copied tika-mimetypes.xml from the tika jar file and installed 
> a copy in my configuration directory.  I have updated nutch-site.xml 
> to point to this file and the log entries indicate that this is being found.
> 
>  
> 
> In my copy of tika-mimetypes.xml I have added the match rule shown 
> below
> 
>  
> 
> <mime-type type="application/json">
> 
>           <sub-class-of type="application/javascript"/>
> 
>           <magic priority="100">
> 
>                   <match value="{" type="string" offset="0"/>
> 
>           </magic>
> 
>           <glob pattern="*.json"/>
> 
>   </mime-type>
> 
>  
> 
> I know that my match is much too broad, but I am using this just while 
> trying to resolve this problem.
> 
>  
> 
> I have also set lang.extraction.policy to identify in nutch-site.xml 
> (again primarily for testing purposes).
> 
>  
> 
> I am still getting the content type detected as text/html and the json 
> parser is not being invoked.  Any suggestions as to what to look at next?
> 
>  
> 
> Thanks!
> 
>  
> 
> Iain
> 
> 


Reply via email to