[ 
https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2058:
----------------------------------------

    Assignee: Chris A. Mattmann

> Indexer plugin that allows RegEx replacements on the NutchDocument field 
> values
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-2058
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2058
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Peter Ciuffetti
>            Assignee: Chris A. Mattmann
>             Fix For: 1.11
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> This is the description of a IndexingFilter plugin I'm developing that allows 
> regex replacements on field values prior to indexing to your search engine.
> *Plugin name*: index-replace
> *Property name*: index.replace.regexp
> *Use case example:*
> I'm indexing Nutch-created documents to a pre-existing SOLR core.  In this 
> case I need to coerce the documents into the schema and field formats 
> expected by the existing core.  The features of index-static and 
> solrindex-mapping.xml get me most of the way.  Among other things, I need to 
> generate identifiers from the web URLs.  So I need to do something like a 
> regex replace on the id provided and then (with solrindex-mapping.xml) move 
> this to the field name defined by the existing core.
> Another use case might be to refactor all URLs stored in the document so they 
> route through a redirector gateway.
> The following is from the draft description in nutch-default.xml
> *Description:*
> Allows indexing-time regexp replace manipulation of metadata fields. The 
> format of the property is a list of regexp replacements, one line per field 
> being modified.  To use this property, add index-replace to your list of 
> activated plugins.
>     
> *Example:*
> {code:xml}
> <property>
>   <name>index.replace.regexp</name>
>   <value>
>         fldname1=/regexp/replacement/flags
>         fldname2=/regexp/replacement/flags
>   </value>
> </property>
> {code}
> Field names would be one of those from 
> https://wiki.apache.org/nutch/IndexStructure. The replacements will happen in 
> the order listed. If a field needs multiple replacement operations they may 
> be listed more than once.
> The *field name* precedes the equal sign.  The first character after the 
> equal sign signifies the delimiter for the regexp, the replacement value and 
> the flags.
> The *regexp* and the optional *flags* should correspond to 
> Pattern.compile(String regexp, int flags) defined here: 
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile%28java.lang.String,%20int%29
> The *flags* is an integer sum of the flag values defined in 
> http://docs.oracle.com/javase/7/docs/api/constant-values.html (Sec: 
> java.util.regex.Pattern)
> Patterns are compiled when the plugin is initialized for efficiency.
> *Escaping*: since the regexp is being read from a config file, any escaped 
> values must be double escaped.  Eg:  {code}
>   id=/\\s+//
> {code} will cause the escaped \s+ match pattern to be used.
> The *replacement* value should correspond to Java Matcher(CharSequence 
> input).replaceAll(String replacement):  
> http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll%28java.lang.String%29
>     
> *Multi-valued Fields*
> If a field has multiple values, the replacement will be applied to each value 
> in turn.
> *Non-string Datatypes*
> Replacement is possible only on String field datatypes.  If the field you 
> name in the property is not a String datatype, it will be silently ignored.
> *Host and URL specific replacements*
> If the replacements should apply only to specifc pages, then add a sequence 
> like
> {code}
>     hostmatch=hostmatchpattern
>     fld1=/regexp/replace/flags
>     fld2=/regexp/replace/flags
> {code}
>     or
> {code}
>     urlmatch=urlmatchpattern
>     fld1=/regexp/replace/flags
>     fld2=/regexp/replace/flags
> {code}
> When using Host and URL replacements, all replacements preceding the first 
> hostmatch or urlmatch will apply to all Nutch documents.  Replacements 
> following a hostmatch or urlmatch will be applied to Nutch documents that 
> match the host or url field (up to the next hostmatch or urlmatch line).  
> hostmatch and urlmatch patterns must be unique in this property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to