Hi Rich, Glad you got it to work. You get the metadata in the crawldatum indeed, as if they were passed via the injection. From there you can use the urlmeta + index-metadata plugins.
Would be worth checking whether Tika passes on the metadata in which case you could have a HTMLParseFilter to pull the stuff with XPath and then add the metadata to the outlinks. It would be a bit neater as you wouldn't need to hack the feed plugin at all. Thanks for sharing your experience Julien On 8 August 2013 22:33, Richard Bergmann <rbergm...@colsa.com> wrote: > Julien, > > No need to reply -- I "guessed" properly. The metadata that I am stuffing > into the outlinks is, indeed, coming back to me in the CrawlDatum, so I am > now successfully building my index with the crawled/linked page content and > the RSS feed item info (from metadata). > > Of course this required your patch (NUTCH-1622). Thank you! > > Rich Bergmann > > -----Original Message----- > From: Richard Bergmann [mailto:rbergm...@colsa.com] > Sent: Thursday, August 08, 2013 12:58 PM > To: dev@nutch.apache.org > Subject: RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata > > Julien, > > I am trying to save myself a bit of time here by asking you this question > (and making all subscribers listen!) before digging into the code: > > Based on this patch (which I have applied), where will the metadata show > up when it gets to my IndexingFilter extension? CrawlDatum.getMetaData()? > Somewhere else? Do I have to modify an Html parser to ensure the metadata > gets to my IndexingFilter? > > With the current "feed" Parser and IndexingFilter the metadata I am > interested in is stuffed into the parse metadata: > Parse.getData().getParseMeta(). > > Thank you! > > Rich Bergmann > > -----Original Message----- > From: Julien Nioche (JIRA) [mailto:j...@apache.org] > Sent: Thursday, August 08, 2013 11:07 AM > To: dev@nutch.apache.org > Subject: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata > > > [ > https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Julien Nioche updated NUTCH-1622: > --------------------------------- > > Attachment: NUTCH-1622.patch > > > Create Outlinks with metadata > > ----------------------------- > > > > Key: NUTCH-1622 > > URL: https://issues.apache.org/jira/browse/NUTCH-1622 > > Project: Nutch > > Issue Type: New Feature > > Components: parser > > Affects Versions: 1.7, 2.2.1 > > Reporter: Julien Nioche > > Attachments: NUTCH-1622.patch > > > > > > Having the possibility to specify metadata when creating an outlink is > extremely useful as it allows to pass information from a source page to the > pages it links to. We use that routinely within our custom parsers in > combination with the url-meta plugin. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators For more information on JIRA, see: > http://www.atlassian.com/software/jira > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble