[ https://issues.apache.org/jira/browse/SOLR-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905044#comment-16905044 ]
Chongchen Chen commented on SOLR-13242: --------------------------------------- I wrote a unit test, But I cannot reproduce your problem. Actually, the RegexReplaceProcessorFactory is only a wrapper for java.util.regex.Pattern. So I don't think there will be something wrong in it. My code is: {code:java} package org.apache.solr.update.processor; import org.apache.solr.SolrTestCaseJ4; import org.apache.solr.common.SolrInputDocument; import org.apache.solr.common.params.ModifiableSolrParams; import org.apache.solr.core.SolrCore; import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.response.SolrQueryResponse; import org.apache.solr.servlet.SolrRequestParsers; import org.apache.solr.update.AddUpdateCommand; import org.junit.AfterClass; import org.junit.Before; import org.junit.BeforeClass; import org.junit.Test; public class RegexReplaceProcessorFactoryTest extends SolrTestCaseJ4 { protected static FieldValueMutatingUpdateProcessor[] reProcessors; protected static SolrRequestParsers _parser; protected static ModifiableSolrParams parameters; private static RegexReplaceProcessorFactory[] factorys; private SolrInputDocument document; @BeforeClass public static void setUpBeforeClass() throws Exception { System.setProperty("enable.update.log", "false"); // schema12 doesn't support _version_ initCore("solrconfig.xml", "schema12.xml"); SolrCore core = h.getCore(); _parser = new SolrRequestParsers( null ); SolrQueryResponse resp = null; parameters = new ModifiableSolrParams(); parameters.set("fieldName", "mail"); String[] patterns = {"(\\s*\\r?\\n){2,}", "([ \\s]*\\r?\\n){2,}", "(\\s*\\n){2,}", "(\\n\\s*){2,}"}; parameters.set("replacement", "<br><br>"); parameters.set("literalReplacement", "true"); factorys = new RegexReplaceProcessorFactory[patterns.length]; reProcessors = new FieldValueMutatingUpdateProcessor[patterns.length]; SolrQueryRequest req = _parser.buildRequestFrom(core, new ModifiableSolrParams(), null); for (int i = 0; i < patterns.length; i++){ parameters.set("pattern", patterns[i]); factorys[i] = new RegexReplaceProcessorFactory(); factorys[i].init(parameters.toNamedList()); factorys[i].inform(core); reProcessors[i] = (FieldValueMutatingUpdateProcessor) factorys[i].getInstance(req, resp, null); } } @AfterClass public static void tearDownAfterClass() throws Exception { // null static members for gc reProcessors = null; _parser = null; parameters = null; factorys = null; } @Before public void setUp() throws Exception { document = new SolrInputDocument(); super.setUp(); } @Test public void testSOLR13242() throws Exception { document.addField("id", "doc1"); document.addField("mail", "exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa Chu Kang Avenue 4, Singapore"); processAdd(document); System.out.println(document.getFieldValue("mail")); } private void processAdd(SolrInputDocument doc) throws Exception { AddUpdateCommand addCommand = new AddUpdateCommand(null); addCommand.solrDoc = doc; for (int i = 0; i < reProcessors.length; i++){ reProcessors[i].processAdd(addCommand); } } } {code} > RegexReplaceProcessorFactory not making accurate replacement > ------------------------------------------------------------ > > Key: SOLR-13242 > URL: https://issues.apache.org/jira/browse/SOLR-13242 > Project: Solr > Issue Type: Bug > Affects Versions: 7.6, 7.7, 7.7.1 > Reporter: Edwin Yeo Zheng Lin > Priority: Major > Labels: regex, solr > > We are using the RegexReplaceProcessorFactory, and have tried with all of the > following configurations in solrconfig.xml: > > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\s*\r?\n)\{2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">([ \s]*\r?\n)\{2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\s*\n)\{2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\n\s*)\{2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > > The regex pattern of (\s*\r?\n)\{2,}, ([ \s]*\r?\n)\{2,}, (\s*\n)\{2,} and > (\n\s*)\{2,} are working perfectly in [regex101.com|http://regex101.com/], in > which all the \n will be replaced by only two <br> > However, in Solr, there are cases (in Example 2 and 3 below) that has four > <br> in a row. This should not be the case, as we have already set it to > replace by two <br> regardless of how many \n are there in a row. > > > *Example 1: The sentence that the above regex pattern is working correctly* > *Original content in EML [file:*|file://%2A/] > Dear Sir, > > I am terminating > *Original content:* Dear Sir, \n\n \n \n\n I am terminating > *Index content:* Dear Sir, <br><br>I am terminating > > *Example 2: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>)* > *Original content in EML [file:*|file://%2A/] > _exalted_ > _Psalm 89:17_ > > 3 Choa Chu Kang Avenue 4 > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa Chu > Kang Avenue 4, Singapore > *Index content:* exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa Chu > Kang Avenue 4, Singapore > > *Example 3: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>)* > *Original content in EML [file:*|file://%2A/] > [http://www.concordpri.moe.edu.sg/] > > > > > On Tue, Dec 18, 2018 at 10:07 AM > *Original content:* [http://www.concordpri.moe.edu.sg/] \n\n \n\n \n \n\n > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, 2018 > at 10:07 AM > *Index content:* [http://www.concordpri.moe.edu.sg/] <br><br> <br><br>On > Tue, Dec 18, 2018 at 10:07 AM -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org