Julien, For what it's worth (and to anyone out there who may be interested in the code), I created a custom parse-feed plugin, which is based on the feed plugin (i.e., I didn't directly "hack" the feed plugin), because I needed to get extra information from the feed item Xml (specifically Geo data, which I got by including the Rome module that does so).
So the parse-feed parser: o Captures the relevant Xml elements, via, Rome, and o Places those element values into a Metadata object, and, o Places that Metadata object into the Outlink for each item. The parse-feed indexer: o Attempts to locate the Metadata in the CrawlDatum, and, if found, o Populates the NutchDocument with fields that correspond to the Metadata entries Thanks again. Rich From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: Friday, August 09, 2013 4:14 AM To: dev@nutch.apache.org Subject: Re: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata Hi Rich, Glad you got it to work. You get the metadata in the crawldatum indeed, as if they were passed via the injection. From there you can use the urlmeta + index-metadata plugins. Would be worth checking whether Tika passes on the metadata in which case you could have a HTMLParseFilter to pull the stuff with XPath and then add the metadata to the outlinks. It would be a bit neater as you wouldn't need to hack the feed plugin at all. Thanks for sharing your experience Julien On 8 August 2013 22:33, Richard Bergmann <rbergm...@colsa.com> wrote: Julien, No need to reply -- I "guessed" properly. The metadata that I am stuffing into the outlinks is, indeed, coming back to me in the CrawlDatum, so I am now successfully building my index with the crawled/linked page content and the RSS feed item info (from metadata). Of course this required your patch (NUTCH-1622). Thank you! Rich Bergmann -----Original Message----- From: Richard Bergmann [mailto:rbergm...@colsa.com] Sent: Thursday, August 08, 2013 12:58 PM To: dev@nutch.apache.org Subject: RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata Julien, I am trying to save myself a bit of time here by asking you this question (and making all subscribers listen!) before digging into the code: Based on this patch (which I have applied), where will the metadata show up when it gets to my IndexingFilter extension? CrawlDatum.getMetaData()? Somewhere else? Do I have to modify an Html parser to ensure the metadata gets to my IndexingFilter? With the current "feed" Parser and IndexingFilter the metadata I am interested in is stuffed into the parse metadata: Parse.getData().getParseMeta(). Thank you! Rich Bergmann -----Original Message----- From: Julien Nioche (JIRA) [mailto:j...@apache.org] Sent: Thursday, August 08, 2013 11:07 AM To: dev@nutch.apache.org Subject: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata [ https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1622: --------------------------------- Attachment: NUTCH-1622.patch > Create Outlinks with metadata > ----------------------------- > > Key: NUTCH-1622 > URL: https://issues.apache.org/jira/browse/NUTCH-1622 > Project: Nutch > Issue Type: New Feature > Components: parser > Affects Versions: 1.7, 2.2.1 > Reporter: Julien Nioche > Attachments: NUTCH-1622.patch > > > Having the possibility to specify metadata when creating an outlink is > extremely useful as it allows to pass information from a source page to the > pages it links to. We use that routinely within our custom parsers in > combination with the url-meta plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble