[ https://issues.apache.org/jira/browse/SOLR-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sunil Srinivasan updated SOLR-4864: ----------------------------------- Attachment: SOLR-4864.patch Thanks Steve for the review I've made the changes and attached the patch > RegexReplaceProcessorFactory should support pattern capture group > substitution in replacement string > ---------------------------------------------------------------------------------------------------- > > Key: SOLR-4864 > URL: https://issues.apache.org/jira/browse/SOLR-4864 > Project: Solr > Issue Type: Improvement > Components: update > Affects Versions: 4.3 > Reporter: Jack Krupansky > Attachments: SOLR-4864.patch, SOLR-4864.patch > > > It is unfortunate the the replacement string for RegexReplaceProcessorFactory > is a pure, "quoted" (escaped) literal and does not support pattern capture > group substitution. This processor should be enhanced to support full, > standard pattern capture group substitution. > The test case I used: > {code} > <updateRequestProcessorChain name="regex-mark-special-words"> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldRegex">.*</str> > <str name="pattern">([^a-zA-Z]|^)(cat|dog|fox)([^a-zA-Z]|$)</str> > <str name="replacement">$1<<$2>>$3</str> > </processor> > <processor class="solr.LogUpdateProcessorFactory" /> > <processor class="solr.RunUpdateProcessorFactory" /> > </updateRequestProcessorChain> > {code} > Indexing with this command against the standard Solr example with the above > addition to solrconfig: > {code} > curl > "http://localhost:8983/solr/update?commit=true&update.chain=regex-mark-special-words" > \ > -H 'Content-type:application/json' -d ' > [{"id": "doc-1", > "title": "Hello World", > "content": "The cat and the dog jumped over the fox.", > "other_ss": ["cat","cat bird", "lazy dog", "red fox den"]}]' > {code} > Alas, the resulting document consists of: > {code} > "id":"doc-1", > "title":["Hello World"], > "content":["The$1<<$2>>$3and the$1<<$2>>$3jumped over the$1<<$2>>$3"], > "other_ss":["$1<<$2>>$3", > "$1<<$2>>$3bird", > "lazy$1<<$2>>$3", > "red$1<<$2>>$3den"], > {code} > The Javadoc for RegexReplaceProcessorFactory uses the exact same terminology > of "replacement string", as does Java's Matcher.replaceAll, but clearly the > semantics are distinct, with replaceAll supporting pattern capture group > substitution for its "replacement string", while RegexReplaceProcessorFactory > interprets "replacement string" as being a literal. At a minimum, the > RegexReplaceProcessorFactory Javadoc should explicitly state that the string > is a literal that does not support pattern capture group substitution. > The relevant code in RegexReplaceProcessorFactory#init: > {code} > replacement = Matcher.quoteReplacement(replacementParam.toString()); > {code} > Possible options for the enhancement: > 1. Simply skip the quoteReplacement and fully support pattern capture group > substitution with no additional changes. Does have a minor backcompat issue. > 2. Add an alternative to "replacement", say "nonQuotedReplacement" that is > not quoted as "replacement" is. > 3. Add an option, say "quotedReplacement" that defaults to "true" for > backcompat, but can be set to "false" to support full replaceAll pattern > capture group substitution. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org