[ 
https://issues.apache.org/jira/browse/SOLR-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766236#comment-16766236
 ] 

Erick Erickson commented on SOLR-13242:
---------------------------------------

Have you tried running this in a Java program rather than an online emulator?

This pattern is suspect "(\s*\n)\{2,}". "\s" _includes_ newlines so I doubt 
it's doing what you expect.

> RegexReplaceProcessorFactory not making accurate replacement
> ------------------------------------------------------------
>
>                 Key: SOLR-13242
>                 URL: https://issues.apache.org/jira/browse/SOLR-13242
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.6
>            Reporter: Edwin Yeo Zheng Lin
>            Priority: Major
>              Labels: regex, solr
>
> We are using the RegexReplaceProcessorFactory with the following configuration
>  
>  <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(\s*\n)\{2,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>  </processor>
>  
> The regex pattern of (\s*\n)\{2,} is working perfectly in 
> [regex101.com|http://regex101.com/], in which all the \n will be replaced by 
> only two <br>
> However, in Solr, there are cases (in Example 2 and 3 below) that has four 
> <br> in a row. This should not be the case, as we have already set it to 
> replace by two <br> regardless of how many \n are there in a row.
>  
>  
> Example 1: The sentence that the above regex pattern is working correctly 
> *Original content in EML file:*  
> Dear Sir, 
>  
> I am terminating 
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content:*     Dear Sir,  <br><br>I am terminating 
>  
> Example 2: The sentence that the above regex pattern is partially working (as 
> you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*    
> _exalted_
> _Psalm 89:17_
>  
> 3 Choa Chu Kang Avenue 4    
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu 
> Kang Avenue 4, Singapore
> *Index content:* exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu 
> Kang Avenue 4, Singapore
>  
> Example 3: The sentence that the above regex pattern is partially working (as 
> you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*    
> [http://www.concordpri.moe.edu.sg/]
>  
>  
>  
>  
> On Tue, Dec 18, 2018 at 10:07 AM    
> *Original content:* [http://www.concordpri.moe.edu.sg/]   \n\n   \n\n \n \n\n 
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 
> at 10:07 AM 
> *Index content:* [http://www.concordpri.moe.edu.sg/]   <br><br>  <br><br>On 
> Tue, Dec 18, 2018 at 10:07 AM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to