Julien,

For what it's worth (and to anyone out there who may be interested in the 
code), I created a custom parse-feed plugin, which is based on the feed plugin 
(i.e., I didn't directly "hack" the feed plugin), because I needed to get extra 
information from the feed item Xml (specifically Geo data, which I got by 
including the Rome module that does so).

So the parse-feed parser:

  o  Captures the relevant Xml elements, via, Rome, and

  o  Places those element values into a Metadata object, and,

  o  Places that Metadata object into the Outlink for each item.


The parse-feed indexer:

  o  Attempts to locate the Metadata in the CrawlDatum, and, if found,

  o  Populates the NutchDocument with fields that correspond to the Metadata 
entries


Thanks again.

Rich

From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: Friday, August 09, 2013 4:14 AM
To: dev@nutch.apache.org
Subject: Re: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

Hi Rich,

Glad you got it to work. You get the metadata in the crawldatum indeed, as if 
they were passed via the injection. From there you can use the urlmeta + 
index-metadata plugins.

Would be worth checking whether Tika passes on the metadata in which case you 
could have a HTMLParseFilter to pull the stuff with XPath and then add the 
metadata to the outlinks. It would be a bit neater as you wouldn't need to hack 
the feed plugin at all.

Thanks for sharing your experience

Julien



On 8 August 2013 22:33, Richard Bergmann <rbergm...@colsa.com> wrote:
Julien,

No need to reply -- I "guessed" properly.  The metadata that I am stuffing into 
the outlinks is, indeed, coming back to me in the CrawlDatum, so I am now 
successfully building my index with the crawled/linked page content and the RSS 
feed item info (from metadata).

Of course this required your patch (NUTCH-1622).  Thank you!

Rich Bergmann

-----Original Message-----
From: Richard Bergmann [mailto:rbergm...@colsa.com]
Sent: Thursday, August 08, 2013 12:58 PM
To: dev@nutch.apache.org
Subject: RE: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata

Julien,

I am trying to save myself a bit of time here by asking you this question (and 
making all subscribers listen!) before digging into the code:

Based on this patch (which I have applied), where will the metadata show up 
when it gets to my IndexingFilter extension?  CrawlDatum.getMetaData()?  
Somewhere else?  Do I have to modify an Html parser to ensure the metadata gets 
to my IndexingFilter?

With the current "feed" Parser and IndexingFilter the metadata I am interested 
in is stuffed into the parse metadata: Parse.getData().getParseMeta().

Thank you!

Rich Bergmann

-----Original Message-----
From: Julien Nioche (JIRA) [mailto:j...@apache.org]
Sent: Thursday, August 08, 2013 11:07 AM
To: dev@nutch.apache.org
Subject: [jira] [Updated] (NUTCH-1622) Create Outlinks with metadata


     [ 
https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1622:
---------------------------------

    Attachment: NUTCH-1622.patch

> Create Outlinks with metadata
> -----------------------------
>
>                 Key: NUTCH-1622
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1622
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.7, 2.2.1
>            Reporter: Julien Nioche
>         Attachments: NUTCH-1622.patch
>
>
> Having the possibility to specify metadata when creating an outlink is 
> extremely useful as it allows to pass information from a source page to the 
> pages it links to. We use that routinely within our custom parsers in 
> combination with the url-meta plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators 
For more information on JIRA, see: http://www.atlassian.com/software/jira




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to