[ 
https://issues.apache.org/jira/browse/SOLR-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905044#comment-16905044
 ] 

Chongchen Chen commented on SOLR-13242:
---------------------------------------

I wrote a unit test, But I cannot reproduce your problem. Actually, the 
RegexReplaceProcessorFactory is only a wrapper for java.util.regex.Pattern. So 
I don't think there will be something wrong in it.  My code is:


{code:java}
package org.apache.solr.update.processor;

import org.apache.solr.SolrTestCaseJ4;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.params.ModifiableSolrParams;
import org.apache.solr.core.SolrCore;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.response.SolrQueryResponse;
import org.apache.solr.servlet.SolrRequestParsers;
import org.apache.solr.update.AddUpdateCommand;
import org.junit.AfterClass;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

public class RegexReplaceProcessorFactoryTest extends SolrTestCaseJ4 {
  protected static FieldValueMutatingUpdateProcessor[] reProcessors;
  protected static SolrRequestParsers _parser;
  protected static ModifiableSolrParams parameters;
  private static RegexReplaceProcessorFactory[] factorys;
  private SolrInputDocument document;

  @BeforeClass
  public static void setUpBeforeClass() throws Exception {
    System.setProperty("enable.update.log", "false"); // schema12 doesn't 
support _version_
    initCore("solrconfig.xml", "schema12.xml");
    SolrCore core = h.getCore();
    _parser = new SolrRequestParsers( null );
    SolrQueryResponse resp = null;
    parameters = new ModifiableSolrParams();
    parameters.set("fieldName", "mail");
    String[] patterns = {"(\\s*\\r?\\n){2,}", "([ \\s]*\\r?\\n){2,}", 
"(\\s*\\n){2,}", "(\\n\\s*){2,}"};
    parameters.set("replacement", "<br><br>");
    parameters.set("literalReplacement", "true");
    factorys = new RegexReplaceProcessorFactory[patterns.length];
    reProcessors = new FieldValueMutatingUpdateProcessor[patterns.length];
    SolrQueryRequest req = _parser.buildRequestFrom(core, new 
ModifiableSolrParams(), null);
    for (int i = 0; i < patterns.length; i++){
      parameters.set("pattern",  patterns[i]);
      factorys[i] = new RegexReplaceProcessorFactory();
      factorys[i].init(parameters.toNamedList());
      factorys[i].inform(core);
      reProcessors[i] = (FieldValueMutatingUpdateProcessor) 
factorys[i].getInstance(req, resp, null);
    }
  }

  @AfterClass
  public static void tearDownAfterClass() throws Exception {
    // null static members for gc
    reProcessors = null;
    _parser = null;
    parameters = null;
    factorys = null;
  }

  @Before
  public void setUp() throws Exception {
    document = new SolrInputDocument();
    super.setUp();
  }

  @Test
  public void testSOLR13242() throws Exception {
    document.addField("id", "doc1");
    document.addField("mail", "exalted \n \n\n Psalm 89:17   \n\n   \n\n  3 
Choa Chu Kang Avenue 4, Singapore");
    processAdd(document);
    System.out.println(document.getFieldValue("mail"));
  }

  private void processAdd(SolrInputDocument doc) throws Exception {
    AddUpdateCommand addCommand = new AddUpdateCommand(null);
    addCommand.solrDoc = doc;
    for (int i = 0; i < reProcessors.length; i++){
      reProcessors[i].processAdd(addCommand);
    }
  }

}

{code}

> RegexReplaceProcessorFactory not making accurate replacement
> ------------------------------------------------------------
>
>                 Key: SOLR-13242
>                 URL: https://issues.apache.org/jira/browse/SOLR-13242
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 7.6, 7.7, 7.7.1
>            Reporter: Edwin Yeo Zheng Lin
>            Priority: Major
>              Labels: regex, solr
>
> We are using the RegexReplaceProcessorFactory, and have tried with all of the 
> following configurations in solrconfig.xml:
>  
> <processor class="solr.RegexReplaceProcessorFactory">
>     <str name="fieldName">content</str>
>     <str name="pattern">(\s*\r?\n)\{2,}</str>
>     <str name="replacement"><br><br></str>
>     <bool name="literalReplacement">true</bool>
>   </processor>
> <processor class="solr.RegexReplaceProcessorFactory">
>     <str name="fieldName">content</str>
>     <str name="pattern">([ \s]*\r?\n)\{2,}</str>
>     <str name="replacement"><br><br></str>
>     <bool name="literalReplacement">true</bool>
>   </processor>
>  <processor class="solr.RegexReplaceProcessorFactory">
>     <str name="fieldName">content</str>
>     <str name="pattern">(\s*\n)\{2,}</str>
>     <str name="replacement"><br><br></str>
>     <bool name="literalReplacement">true</bool>
>   </processor>
>  <processor class="solr.RegexReplaceProcessorFactory">
>     <str name="fieldName">content</str>
>     <str name="pattern">(\n\s*)\{2,}</str>
>     <str name="replacement"><br><br></str>
>     <bool name="literalReplacement">true</bool>
>   </processor>
>  
> The regex pattern of (\s*\r?\n)\{2,}, ([ \s]*\r?\n)\{2,}, (\s*\n)\{2,} and 
> (\n\s*)\{2,} are working perfectly in [regex101.com|http://regex101.com/], in 
> which all the \n will be replaced by only two <br>
> However, in Solr, there are cases (in Example 2 and 3 below) that has four 
> <br> in a row. This should not be the case, as we have already set it to 
> replace by two <br> regardless of how many \n are there in a row.
>  
>  
> *Example 1: The sentence that the above regex pattern is working correctly* 
> *Original content in EML [file:*|file://%2A/]  
> Dear Sir, 
>  
> I am terminating 
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Index content:*     Dear Sir,  <br><br>I am terminating 
>  
> *Example 2: The sentence that the above regex pattern is partially working 
> (as you can see, instead of 2 <br>, there are 4 <br>)*
> *Original content in EML [file:*|file://%2A/]    
> _exalted_
> _Psalm 89:17_
>  
> 3 Choa Chu Kang Avenue 4    
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa Chu 
> Kang Avenue 4, Singapore
> *Index content:* exalted  <br><br>Psalm 89:17   <br><br>  <br><br>3 Choa Chu 
> Kang Avenue 4, Singapore
>  
> *Example 3: The sentence that the above regex pattern is partially working 
> (as you can see, instead of 2 <br>, there are 4 <br>)*
> *Original content in EML [file:*|file://%2A/]    
> [http://www.concordpri.moe.edu.sg/]
>  
>  
>  
>  
> On Tue, Dec 18, 2018 at 10:07 AM    
> *Original content:* [http://www.concordpri.moe.edu.sg/]   \n\n   \n\n \n \n\n 
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 
> at 10:07 AM 
> *Index content:* [http://www.concordpri.moe.edu.sg/]   <br><br>  <br><br>On 
> Tue, Dec 18, 2018 at 10:07 AM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to