seems it does not support HTML5 tags,in given patch  the assert statements are 
failing because of that.
Thanks

> On Mar 18, 2016, at 3:16 AM, Markus Jelsma <markus.jel...@openindex.io> wrote:
> 
> Hello! Nutch doesn't have a mechanism to extract microdata from HTML. But 
> there is a patch for Apache Tika that comes as a content handler, TIKA-980. 
> You can embed it into another content handler or use Tika's TeeContentHandler 
> in Nutch' parse-tika plugin. Downside is that you have to transform the 
> output data structure to a Writable in the plugin, otherwise you cannot store 
> it as metadata and run on Hadoop.
> 
> https://issues.apache.org/jira/browse/TIKA-980
> 
> Markus
> 
> 
> 
> -----Original message-----
>> From:Manish Verma <m_ve...@apple.com>
>> Sent: Thursday 17th March 2016 19:18
>> To: user@nutch.apache.org
>> Subject: Extract Microdata
>> 
>> Hi,
>> 
>> I need to crawl on Urls and extract micro data and save to solr. Does Nutch 
>> support extraction of schema org micro data.
>> 
>> Thanks
>> 
>> 
>> 

Reply via email to