[ https://issues.apache.org/jira/browse/NUTCH-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris A. Mattmann reassigned NUTCH-2058: ---------------------------------------- Assignee: Chris A. Mattmann > Indexer plugin that allows RegEx replacements on the NutchDocument field > values > ------------------------------------------------------------------------------- > > Key: NUTCH-2058 > URL: https://issues.apache.org/jira/browse/NUTCH-2058 > Project: Nutch > Issue Type: Improvement > Components: indexer > Reporter: Peter Ciuffetti > Assignee: Chris A. Mattmann > Fix For: 1.11 > > Original Estimate: 48h > Remaining Estimate: 48h > > This is the description of a IndexingFilter plugin I'm developing that allows > regex replacements on field values prior to indexing to your search engine. > *Plugin name*: index-replace > *Property name*: index.replace.regexp > *Use case example:* > I'm indexing Nutch-created documents to a pre-existing SOLR core. In this > case I need to coerce the documents into the schema and field formats > expected by the existing core. The features of index-static and > solrindex-mapping.xml get me most of the way. Among other things, I need to > generate identifiers from the web URLs. So I need to do something like a > regex replace on the id provided and then (with solrindex-mapping.xml) move > this to the field name defined by the existing core. > Another use case might be to refactor all URLs stored in the document so they > route through a redirector gateway. > The following is from the draft description in nutch-default.xml > *Description:* > Allows indexing-time regexp replace manipulation of metadata fields. The > format of the property is a list of regexp replacements, one line per field > being modified. To use this property, add index-replace to your list of > activated plugins. > > *Example:* > {code:xml} > <property> > <name>index.replace.regexp</name> > <value> > fldname1=/regexp/replacement/flags > fldname2=/regexp/replacement/flags > </value> > </property> > {code} > Field names would be one of those from > https://wiki.apache.org/nutch/IndexStructure. The replacements will happen in > the order listed. If a field needs multiple replacement operations they may > be listed more than once. > The *field name* precedes the equal sign. The first character after the > equal sign signifies the delimiter for the regexp, the replacement value and > the flags. > The *regexp* and the optional *flags* should correspond to > Pattern.compile(String regexp, int flags) defined here: > http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#compile%28java.lang.String,%20int%29 > The *flags* is an integer sum of the flag values defined in > http://docs.oracle.com/javase/7/docs/api/constant-values.html (Sec: > java.util.regex.Pattern) > Patterns are compiled when the plugin is initialized for efficiency. > *Escaping*: since the regexp is being read from a config file, any escaped > values must be double escaped. Eg: {code} > id=/\\s+// > {code} will cause the escaped \s+ match pattern to be used. > The *replacement* value should correspond to Java Matcher(CharSequence > input).replaceAll(String replacement): > http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#replaceAll%28java.lang.String%29 > > *Multi-valued Fields* > If a field has multiple values, the replacement will be applied to each value > in turn. > *Non-string Datatypes* > Replacement is possible only on String field datatypes. If the field you > name in the property is not a String datatype, it will be silently ignored. > *Host and URL specific replacements* > If the replacements should apply only to specifc pages, then add a sequence > like > {code} > hostmatch=hostmatchpattern > fld1=/regexp/replace/flags > fld2=/regexp/replace/flags > {code} > or > {code} > urlmatch=urlmatchpattern > fld1=/regexp/replace/flags > fld2=/regexp/replace/flags > {code} > When using Host and URL replacements, all replacements preceding the first > hostmatch or urlmatch will apply to all Nutch documents. Replacements > following a hostmatch or urlmatch will be applied to Nutch documents that > match the host or url field (up to the next hostmatch or urlmatch line). > hostmatch and urlmatch patterns must be unique in this property. -- This message was sent by Atlassian JIRA (v6.3.4#6332)