I have a page that I am fetching that contains JSON and I have a plugin for
parsing JSON.

 

The server sets a mimetype of "text/html" and consequently my json parser
does not get invoked.

 

If I run parsechecker from the command line and specify -forceAs
"application/json" the json parser is invoked and works successfully.

 

So, I believe that if I can get tika to give me "application/json" as the
detected content type for this page, it should work during a crawl.

 

I have copied tika-mimetypes.xml from the tika jar file and installed a copy
in my configuration directory.  I have updated nutch-site.xml to point to
this file and the log entries indicate that this is being found.

 

In my copy of tika-mimetypes.xml I have added the match rule shown below

 

<mime-type type="application/json">

          <sub-class-of type="application/javascript"/>

          <magic priority="100">

                  <match value="{" type="string" offset="0"/>

          </magic>

          <glob pattern="*.json"/>

  </mime-type>

 

I know that my match is much too broad, but I am using this just while
trying to resolve this problem.

 

I have also set lang.extraction.policy to identify in nutch-site.xml (again
primarily for testing purposes).

 

I am still getting the content type detected as text/html and the json
parser is not being invoked.  Any suggestions as to what to look at next?

 

Thanks!

 

Iain

Reply via email to