NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

David Ferrero Thu, 08 Feb 2018 09:20:16 -0800

Pull request #205 was recently merged into master branch for Nutch 1.x in 
fulfillment of NUTCH-1129 "microdata for Nutch 1.x"


I am new to nutch and solr and have just started crawling and indexing a few 
select websites. Using the built in html parsing/indexing, I am getting 
searchable fields like url, content, host, sometimes a title, and a few other 
indexing related fields like digest, boost, segment, and tstamp. That said, I 
realized very quickly that I need better results. While exploring the source of 
the website, I noticed references to schema.org and get excited by what I see. 
That’s how I stumbled upon NUTCH-1129.

I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer. 

Q: Now what?  How do I gain Any23 microdata parsing / indexing capabilities 
introduced by NUTCH-1129? 
Q: Do I replace parse-(html | tika)|index-(basic | anchor) in plugin.includes 
with something like parse-(html | tika | any23)|index-(basic | anchor | any23)
Q: How do I expose the discovered microdata structure / items to end-user such 
as Solr? For example, what are the microdata items and do I need to map them to 
Solr in solrindex-mapping.xml?

I’d also be interested to learn how to point at a specific URL and see how 
nutch sees the microdata (best case), then learn how to leverage this into 
nutch and finally into solr. 

Thanks for any guidance.
-David

NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

Reply via email to