Hi Rich,

Glad you got it to work. You get the metadata in the crawldatum indeed, as
if they were passed via the injection. From there you can use the urlmeta +
index-metadata plugins.

Would be worth checking whether Tika passes on the metadata in which case
you could have a HTMLParseFilter to pull the stuff with XPath and then add
the metadata to the outlinks. It would be a bit neater as you wouldn't need
to hack the feed plugin at all.

Thanks for sharing your experience

Julien




On 8 August 2013 22:33, Richard Bergmann <rbergm...@colsa.com> wrote:

> Julien,
>
> No need to reply -- I "guessed" properly.  The metadata that I am stuffing
> into the outlinks is, indeed, coming back to me in the CrawlDatum, so I am
> now successfully building my index with the crawled/linked page content and
> the RSS feed item info (from metadata).
>
> Of course this required your patch (NUTCH-1622).  Thank you!
>
> Rich Bergmann
>
> -----Original Message-----
> From: Richard Bergmann [mailto:rbergm...@colsa.com]
> Sent: Thursday, August 08, 2013 12:58 PM
> To: dev@nutch.apache.org
> Subject: RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata
>
> Julien,
>
> I am trying to save myself a bit of time here by asking you this question
> (and making all subscribers listen!) before digging into the code:
>
> Based on this patch (which I have applied), where will the metadata show
> up when it gets to my IndexingFilter extension?  CrawlDatum.getMetaData()?
>  Somewhere else?  Do I have to modify an Html parser to ensure the metadata
> gets to my IndexingFilter?
>
> With the current "feed" Parser and IndexingFilter the metadata I am
> interested in is stuffed into the parse metadata:
> Parse.getData().getParseMeta().
>
> Thank you!
>
> Rich Bergmann
>
> -----Original Message-----
> From: Julien Nioche (JIRA) [mailto:j...@apache.org]
> Sent: Thursday, August 08, 2013 11:07 AM
> To: dev@nutch.apache.org
> Subject: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata
>
>
>      [
> https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Julien Nioche updated NUTCH-1622:
> ---------------------------------
>
>     Attachment: NUTCH-1622.patch
>
> > Create Outlinks with metadata
> > -----------------------------
> >
> >                 Key: NUTCH-1622
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-1622
> >             Project: Nutch
> >          Issue Type: New Feature
> >          Components: parser
> >    Affects Versions: 1.7, 2.2.1
> >            Reporter: Julien Nioche
> >         Attachments: NUTCH-1622.patch
> >
> >
> > Having the possibility to specify metadata when creating an outlink is
> extremely useful as it allows to pass information from a source page to the
> pages it links to. We use that routinely within our custom parsers in
> combination with the url-meta plugin.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators For more information on JIRA, see:
> http://www.atlassian.com/software/jira
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to