[ 
https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1264:
---------------------------------

    Description: 
We currently have several plugins already distributed or proposed which do very 
comparable things : 
- parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
index them
- headings [NUTCH-1005] to generate headings fields in parse-metadata and index 
them
- index-extra [NUTCH-422] to index configurable fields 
- urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and 
index them
- index-static [NUTCH-940] to generate configurable static fields 

All these plugins have in common that they allow to extract information from 
various sources and generate fields from them and are largely redundant. 
Instead this issue proposes to have a single plugin allowing to generate 
configurable fields from : 
- static values
- parse metadata
- content metadata
- crawldb metadata

and let the other plugins focus on the parsing and extraction of the values to 
index. This will make the addition of new fields simpler by relying on a stable 
common plugin instead of multiplying the code in various plugins.

This plugin will replace index-extra [NUTCH-422] and will serve as a basis for 
further improvements.




  was:
We currently have several plugins already distributed or proposed which do very 
comparable things : 
- parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
index them
- headings [NUTCH-1005] to generate headings fields in parse-metadata and index 
them
- index-extra [NUTCH-422] to index configurable fields 
- urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and 
index them
- index-static [NUTCH-940] to generate configurable static fields 

All these plugins have in common that they allow to extract information from 
various sources and generate fields from them and are largely redundant. 
Instead this issue proposes to have a single plugin allowing to generate 
configurable fields from : 
- static values
- parse metadata
- content metadata
- crawldb metadata

and let the other plugins focus on the parsing and extraction of the values to 
index. This will make the addition of new fields simpler by relying on a stable 
common plugin instead of multiplying the code in various plugins.

This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] 
and will serve as a basis for further improvements.




        Summary: Configurable indexing plugin (index-metadata)   (was: 
Configurable indexing plugin (index-extra) )
    
> Configurable indexing plugin (index-metadata) 
> ----------------------------------------------
>
>                 Key: NUTCH-1264
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1264
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.5
>            Reporter: Julien Nioche
>         Attachments: NUTCH-1264-trunk-v2.patch, NUTCH-1264-trunk.patch
>
>
> We currently have several plugins already distributed or proposed which do 
> very comparable things : 
> - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
> index them
> - headings [NUTCH-1005] to generate headings fields in parse-metadata and 
> index them
> - index-extra [NUTCH-422] to index configurable fields 
> - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks 
> and index them
> - index-static [NUTCH-940] to generate configurable static fields 
> All these plugins have in common that they allow to extract information from 
> various sources and generate fields from them and are largely redundant. 
> Instead this issue proposes to have a single plugin allowing to generate 
> configurable fields from : 
> - static values
> - parse metadata
> - content metadata
> - crawldb metadata
> and let the other plugins focus on the parsing and extraction of the values 
> to index. This will make the addition of new fields simpler by relying on a 
> stable common plugin instead of multiplying the code in various plugins.
> This plugin will replace index-extra [NUTCH-422] and will serve as a basis 
> for further improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to