For those you wishing to use Nutch 1.9, and its Tika plugin to parse a Json,
You need to change the TikaParser.java, as the current Tika implementation
won't return a TxtParser for 'application/json' Mime Types.
So either change the mimeType on TikaParser to 'application/text' before
calling:
Parser parser = tikaConfig.getParser(MediaType.parse(mimeType));
Or for json's just say:
if (mimeType.equalsIgnoreCase("application/json")) {
parser = new TXTParser();
}
Build you're plugin and you're done.
Iqbal Shaikh
________________________________________
Sent: 03 September 2014 16:36
To: [email protected]
Subject: Parsing Json
Hi All,
Am using Nutch 1.9 (latest) and trying to parse a json input source but getting:
Error parsing: http://www.jsonip.com: failed(2,0): Can't retrieve Tika parser
for mime-type application/json
Now my plugin property is:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|indexer-elastic|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to include.
</description>
</property>
And my regex filter:
# accept anything else
+.
I thought simply specifying Tika as one of the parsers would do the job as Tika
recognises json mime type.
Thanks in advance.
Iqbal Shaikh
Transform is a trading division of Engine Partners UK LLP, a limited liability
partnership registered in England & Wales with registered number OC365812.
Our registered office is at 60 Great Portland Street, London W1W 7RT, United
Kingdom.
A list of our members is open for inspection at our registered office.