[ https://issues.apache.org/jira/browse/SOLR-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edwin Yeo Zheng Lin updated SOLR-13242: --------------------------------------- Affects Version/s: 7.7.1 > RegexReplaceProcessorFactory not making accurate replacement > ------------------------------------------------------------ > > Key: SOLR-13242 > URL: https://issues.apache.org/jira/browse/SOLR-13242 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 7.6, 7.7, 7.7.1 > Reporter: Edwin Yeo Zheng Lin > Priority: Major > Labels: regex, solr > > We are using the RegexReplaceProcessorFactory, and have tried with all of the > following configurations in solrconfig.xml: > > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\s*\r?\n)\{2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">([ \s]*\r?\n)\{2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\s*\n)\{2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(\n\s*)\{2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > > The regex pattern of (\s*\r?\n)\{2,}, ([ \s]*\r?\n)\{2,}, (\s*\n)\{2,} and > (\n\s*)\{2,} are working perfectly in [regex101.com|http://regex101.com/], in > which all the \n will be replaced by only two <br> > However, in Solr, there are cases (in Example 2 and 3 below) that has four > <br> in a row. This should not be the case, as we have already set it to > replace by two <br> regardless of how many \n are there in a row. > > > *Example 1: The sentence that the above regex pattern is working correctly* > *Original content in EML [file:*|file://%2A/] > Dear Sir, > > I am terminating > *Original content:* Dear Sir, \n\n \n \n\n I am terminating > *Index content:* Dear Sir, <br><br>I am terminating > > *Example 2: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>)* > *Original content in EML [file:*|file://%2A/] > _exalted_ > _Psalm 89:17_ > > 3 Choa Chu Kang Avenue 4 > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa Chu > Kang Avenue 4, Singapore > *Index content:* exalted <br><br>Psalm 89:17 <br><br> <br><br>3 Choa Chu > Kang Avenue 4, Singapore > > *Example 3: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>)* > *Original content in EML [file:*|file://%2A/] > [http://www.concordpri.moe.edu.sg/] > > > > > On Tue, Dec 18, 2018 at 10:07 AM > *Original content:* [http://www.concordpri.moe.edu.sg/] \n\n \n\n \n \n\n > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, 2018 > at 10:07 AM > *Index content:* [http://www.concordpri.moe.edu.sg/] <br><br> <br><br>On > Tue, Dec 18, 2018 at 10:07 AM -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org