Re: Crawling with nutch and mapping fields to solr
Hi, This question is more suitable for nutch mailing list but let me give you couple of pointers. If its only metadata you can use the below mentioned patch, but if you want more flexibility with your data you can look at writing your own parser plugin, here is a good place to start: http://wiki.apache.org/nutch/WritingPluginExample-0.9 xpath+htmlcleaner+beanshell would be a good set of tools for your custom parser. regards, Ram On Thu, Nov 11, 2010 at 9:21 PM, Jean-Luc wrote: > > I'm going down the route of patching nutch so I can use this ParseMetaTags > plugin: > https://issues.apache.org/jira/browse/NUTCH-809 > > Also wondering whether I will be able to use the XMLParser to allow me to > parse well formed XHTML, using xpath would be bonus: > https://issues.apache.org/jira/browse/NUTCH-185 > > Any thoughts appreciated... > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Crawling-with-nutch-and-mapping-fields-to-solr-tp1879060p1883295.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Crawling with nutch and mapping fields to solr
I'm going down the route of patching nutch so I can use this ParseMetaTags plugin: https://issues.apache.org/jira/browse/NUTCH-809 Also wondering whether I will be able to use the XMLParser to allow me to parse well formed XHTML, using xpath would be bonus: https://issues.apache.org/jira/browse/NUTCH-185 Any thoughts appreciated... -- View this message in context: http://lucene.472066.n3.nabble.com/Crawling-with-nutch-and-mapping-fields-to-solr-tp1879060p1883295.html Sent from the Solr - User mailing list archive at Nabble.com.
Crawling with nutch and mapping fields to solr
Hi I'm fairly new to solr but I have it configured, along with nutch, as per this tutorial http://ubuntuforums.org/showthread.php?p=9596257. Nutch is crawling and injecting documents into solr as expected, however, I want to break the data down further so what ends up in solr is a bit more granular. Can anyone explain in simple terms how I might go about parsing the data I get from nutch and mapping it to custom fields? Ideally I'd like to be able to pull out meta-data from the source HTML and map it to specific fields in solr. I hope I'm in the right place to ask this question. Any help would be much appreciated. Jean-Luc