Hi Paul, I have modified the second pattern to be (<br>){3,}, instead of (<br><br>){3,}. This pattern of (<br><br>){3,} will actually look for 6 or more <br> instead of 3 <br>, as we have put the <br> two times in the pattern, which is the reason that there are more <br> in the result, as cases where there are less than 6 <br> are not being replaced, so we ended up having up to 5 <br> in the index.
Modified configuration: <processor class="solr.RegexReplaceProcessorFactory"> <str name="fieldName">content</str> <str name="pattern">(<br>){3,}</str> <str name="replacement"><br><br></str> <bool name="literalReplacement">true</bool> </processor> This will bring us back to the result of the previous index content, meaning the issue of having the 4 <br> is still there. Regards, Edwin Regards, Edwin On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Paul, > > Further to my previous email, which there was an extra "}" in the > configuration, I have changed to use the below configuration based on your > suggestion. > > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">[ \t]*\r?\n</str> > <str name="replacement"><br></str> > <bool name="literalReplacement">true</bool> > </processor> > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">(<br><br>){3,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > </processor> > > However, the result that I get still has more than 2 <br>. In fact, the > result become worse, as you can see from the comparison below. > > Example 1: The sentence that the regex pattern used to work correctly. But > with the latest pattern, it has now changed from 2 <br> to become 5 <br>, > which is wrong. > *Original content in EML file:* > Dear Sir, > > > I am terminating > *Original content:* Dear Sir, \n\n \n \n\n I am terminating > *Previous Index content: * Dear Sir, <br><br>I am terminating > *Current Index content*: Dear Sir, <br><br><br><br><br> I am terminating > > Example 2: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>) > *Original content in EML file:* > > *exalted* > > *Psalm 89:17* > > > 3 Choa Chu Kang Avenue 4 > *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n 3 Choa > Chu Kang Avenue 4, Singapore > *Previous Index content: *exalted <br><br>Psalm 89:17 <br><br> > <br><br>3 Choa Chu Kang Avenue 4, Singapore > *Current Index content*: <br><br><br> Psalm 89:17<br><br> <br><br> 3 > Choa Chu Kang Avenue 3, Singapor4 > > Example 3: The sentence that the above regex pattern is partially working > (as you can see, instead of 2 <br>, there are 4 <br>). For the latest code, > there are now 5 <br> > *Original content in EML file:* > > http://www.concorded.com/ > > > > > > > > > On Tue, Dec 18, 2018 at 10:07 AM > *Original content:* http://www.concorded.com/ \n\n \n\n \n \n\n \n\n > \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, Dec 18, 2018 at > 10:07 AM > *Previous Index content: *http://www.concorded.com/ <br><br> > <br><br>On Tue, Dec 18, 2018 at 10:07 AM > *Current Index content:* http://www.concorded.com/<br><br> <br><br><br> > On Tue, Dec 18, 2018 at 10:07 AM > > > Regards, > Edwin > > On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > >> Hi Paul, >> >> Thank you for the reply. >> >> I have tried to add the following configuration according to your >> suggestion: >> >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">[ \t]*\r?\n}</str> >> <str name="replacement"><br></str> >> <bool name="literalReplacement">true</bool> >> </processor> >> >> <processor class="solr.RegexReplaceProcessorFactory"> >> <str name="fieldName">content</str> >> <str name="pattern">(<br><br>){3,}</str> >> <str name="replacement"><br><br></str> >> <bool name="literalReplacement">true</bool> >> </processor> >> >> However, none of the \n is being removed this time round. >> Is the order and/or the pattern correct? >> >> Regards, >> Edwin >> >> On Tue, 5 Mar 2019 at 19:54, <paul.d...@ub.unibe.ch> wrote: >> >>> Hi Edwin >>> >>> >>> >>> Try for the first pattern/replacement >>> >>> >>> >>> <str name="pattern">[ \t]*\r?\n</str> >>> >>> <str name="replacement"><br></str> >>> >>> >>> >>> Now all line endings and preceding whitespace characters should be >>> changed to ‘<br>’. >>> >>> >>> >>> The second pattern replacement should replace 3 or more ‘<br>’ sequences >>> to 2 ‘<br>’ sequences: >>> >>> >>> >>> <str name="pattern">(<br><br>){3,}</str> >>> >>> <str name="replacement"><br><br></str> >>> >>> >>> >>> Hope this approach works. Sorry for not replying earlier and best >>> regards, >>> >>> Paul >>> >>> >>> >>> >>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für >>> Windows 10 >>> >>> >>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>> Gesendet: Dienstag, 5. März 2019 03:35 >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n >>> >>> >>> >>> Hi, >>> >>> For your info, this issue is occurring in the new Solr 7.7.1 as well. >>> >>> Regards, >>> Edwin >>> >>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >>> wrote: >>> >>> > Hi, >>> > >>> > Anyone else has other suggestions or have faced the same problem? >>> > >>> > Regards, >>> > Edwin >>> > >>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo < >>> edwinye...@gmail.com> >>> > wrote: >>> > >>> >> Hi Paul, >>> >> >>> >> If I tried to execute the second step first, then I will only get a >>> >> single <br> for those with 2 <br>. >>> >> For those that we originally get 4 <br>, there will be 2 <br> with a >>> >> space in between. >>> >> >>> >> This is just changing the 2 <br> to be a single <br>, since the second >>> >> step is to replace with a single <br>. >>> >> But it has not solved the underlying problem yet. >>> >> >>> >> Regards, >>> >> Edwin >>> >> >>> >> >>> >> On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch> wrote: >>> >> >>> >>> If the second step is executed first, then you will get the unwanted >>> 4 >>> >>> <br> >>> >>> >>> >>> >>> >>> >>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> >>> für >>> >>> Windows 10 >>> >>> >>> >>> >>> >>> >>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29 >>> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> >>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple >>> \n >>> >>> >>> >>> >>> >>> >>> >>> Hi Jörn , >>> >>> >>> >>> Do you mean the regex is not correct? >>> >>> >>> >>> We are already using two RegexReplaceProcessorFactory steps, like >>> the one >>> >>> shown below. The output that we get is still the same. >>> >>> >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> >>> >>> <str name="fieldName">content</str> >>> >>> <str name="pattern">([ \t]*\r?\n){2,}</str> >>> >>> <str name="replacement"><br><br></str> >>> >>> <bool name="literalReplacement">true</bool> >>> >>> <processor> >>> >>> >>> >>> <processor class="solr.RegexReplaceProcessorFactory"> >>> >>> <str name="fieldName">content</str> >>> >>> <str name="pattern">([ \t]*\r?\n){1,}</str> >>> >>> <str name="replacement"><br></str> >>> >>> <bool name="literalReplacement">true</bool> >>> >>> <processor> >>> >>> >>> >>> Regards, >>> >>> Edwin >>> >>> >>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>> >>> >>> > Then you need two regexprocessfactory steps >>> >>> > >>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo < >>> >>> edwinye...@gmail.com >>> >>> > >: >>> >>> > > >>> >>> > > Hi, >>> >>> > > >>> >>> > > Thanks for the reply. >>> >>> > > >>> >>> > > Do you know of any regex online tool that works correctly for >>> Java >>> >>> regex? >>> >>> > > I tried to find some, but they are not working properly. >>> >>> > > >>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and >>> >>> single \n >>> >>> > > with single <br>. >>> >>> > > >>> >>> > > Regards, >>> >>> > > Edwin >>> >>> > > >>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfra...@gmail.com >>> > >>> >>> wrote: >>> >>> > >> >>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it >>> would >>> >>> then >>> >>> > >> be in the JDK. Try out in a regex online Tool that supports Java >>> >>> regex >>> >>> > for >>> >>> > >> your solution. >>> >>> > >> >>> >>> > >> I believe you want to have 2 regex process factories: >>> >>> > >> One that deals with single \n and one that deals with more than >>> one >>> >>> \n >>> >>> > >> >>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < >>> >>> > edwinye...@gmail.com >>> >>> > >>> : >>> >>> > >>> >>> >>> > >>> Hi, >>> >>> > >>> >>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and >>> >>> > >>> configuration: >>> >>> > >>> >>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory"> >>> >>> > >>> <str name="fieldName">content</str> >>> >>> > >>> <str name="pattern">([ \t]*\r?\n){2,}</str> >>> >>> > >>> <str name="replacement"><br><br></str> >>> >>> > >>> <bool name="literalReplacement">true</bool> >>> >>> > >>> </processor> >>> >>> > >>> >>> >>> > >>> However, the issue is still occurring. >>> >>> > >>> >>> >>> > >>> Anyone else is able to help? >>> >>> > >>> >>> >>> > >>> Regards, >>> >>> > >>> Edwin >>> >>> > >>> >>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < >>> >>> > edwinye...@gmail.com> >>> >>> > >>> wrote: >>> >>> > >>> >>> >>> > >>>> Hi, >>> >>> > >>>> >>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well. >>> >>> > >>>> >>> >>> > >>>> Regards, >>> >>> > >>>> Edwin >>> >>> > >>>> >>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < >>> >>> > edwinye...@gmail.com >>> >>> > >>> >>> >>> > >>>> wrote: >>> >>> > >>>> >>> >>> > >>>>> Hi, >>> >>> > >>>>> >>> >>> > >>>>> Should we report this as a bug in Solr? >>> >>> > >>>>> >>> >>> > >>>>> Regards, >>> >>> > >>>>> Edwin >>> >>> > >>>>> >>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < >>> >>> > edwinye...@gmail.com >>> >>> > >>> >>> >>> > >>>>> wrote: >>> >>> > >>>>> >>> >>> > >>>>>> Hi Paul, >>> >>> > >>>>>> >>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we >>> try >>> >>> in on >>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct >>> >>> result for >>> >>> > >> all >>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and >>> not >>> >>> more >>> >>> > >> than >>> >>> > >>>>>> that like what we are getting in Solr in our earlier >>> examples). >>> >>> > >>>>>> >>> >>> > >>>>>> Could there be a possibility of a bug in Solr? >>> >>> > >>>>>> >>> >>> > >>>>>> Regards, >>> >>> > >>>>>> Edwin >>> >>> > >>>>>> >>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < >>> >>> > >> edwinye...@gmail.com> >>> >>> > >>>>>> wrote: >>> >>> > >>>>>> >>> >>> > >>>>>>> Hi Paul, >>> >>> > >>>>>>> >>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str >>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex >>> >>> pattern: >>> >>> > >>>>>>> >>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>> >>> > >>>>>>> <str name="fieldName">content</str> >>> >>> > >>>>>>> <str name="pattern">(\s*\n){2,}</str> >>> >>> > >>>>>>> <str name="replacement"><br><br></str> >>> >>> > >>>>>>> </processor> >>> >>> > >>>>>>> >>> >>> > >>>>>>> However, we are also getting the exact same results as the >>> >>> earlier >>> >>> > >>>>>>> Example 1, 2 and 3. >>> >>> > >>>>>>> >>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other >>> (non >>> >>> > >>>>>>> printing) characters than \n, we have find that there are >>> no >>> >>> non >>> >>> > >> printing >>> >>> > >>>>>>> characters. It is just next line with a space. You can >>> refer >>> >>> to the >>> >>> > >>>>>>> original content in the same examples below. >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is >>> working >>> >>> > >>>>>>> correctly >>> >>> > >>>>>>> *Original content in EML file:* >>> >>> > >>>>>>> Dear Sir, >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> I am terminating >>> >>> > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >>> terminating >>> >>> > >>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >>> >>> > >>>>>>> >>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is >>> >>> partially >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 >>> <br>) >>> >>> > >>>>>>> *Original content in EML file:* >>> >>> > >>>>>>> >>> >>> > >>>>>>> *exalted* >>> >>> > >>>>>>> >>> >>> > >>>>>>> *Psalm 89:17* >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4 >>> >>> > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >>> >>> \n\n 3 >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >>> >>> > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> >>> >>> <br><br>3 >>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore >>> >>> > >>>>>>> >>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is >>> >>> partially >>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 >>> <br>) >>> >>> > >>>>>>> *Original content in EML file:* >>> >>> > >>>>>>> >>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/ >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM >>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ >>> \n\n >>> >>> > \n\n >>> >>> > >> \n >>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n >>> \n\n\n On >>> >>> Tue, >>> >>> > >> Dec 18, >>> >>> > >>>>>>> 2018 at 10:07 AM >>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >>> <br><br> >>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM >>> >>> > >>>>>>> >>> >>> > >>>>>>> >>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may >>> have. >>> >>> > >>>>>>> >>> >>> > >>>>>>> Thank you. >>> >>> > >>>>>>> >>> >>> > >>>>>>> Regards, >>> >>> > >>>>>>> Edwin >>> >>> > >>>>>>> >>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> >>> wrote: >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Hi Edwin >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> 1. Sorry, the pattern was wrong, the space should preceed >>> >>> the \n >>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> >>> >>> > >>>>>>>> 2. Perhaps in the data you have other (non printing) >>> >>> characters >>> >>> > >>>>>>>> than \n? >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Gesendet von Mail< >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986> >>> >>> > >> für >>> >>> > >>>>>>>> Windows 10 >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 >>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto: >>> >>> > solr-user@lucene.apache.org> >>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to >>> detect >>> >>> > >> multiple \n >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Hi Paul, >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow: >>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>> >>> > >>>>>>>> <str name="fieldName">content</str> >>> >>> > >>>>>>>> <str name="pattern">(\n\s*){2,}</str> >>> >>> > >>>>>>>> <str name="replacement"><br><br></str> >>> >>> > >>>>>>>> </processor> >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> But we still have exactly the same problem of Example 1,2 >>> and >>> >>> 3 >>> >>> > >> below. >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is >>> >>> working >>> >>> > >>>>>>>> correctly >>> >>> > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >>> >>> terminating >>> >>> > >>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is >>> >>> partially >>> >>> > >>>>>>>> working >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> >>> > >>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >>> >>> \n\n >>> >>> > 3 >>> >>> > >>>>>>>> Choa >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >>> >>> > >>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> >>> >>> > <br><br>3 >>> >>> > >>>>>>>> Choa >>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is >>> >>> partially >>> >>> > >>>>>>>> working >>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ >>> \n\n >>> >>> > \n\n >>> >>> > >>>>>>>> \n \n\n >>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On >>> >>> Tue, Dec >>> >>> > >> 18, >>> >>> > >>>>>>>> 2018 >>> >>> > >>>>>>>> at 10:07 AM >>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >>> <br><br> >>> >>> > >>>>>>>> <br><br>On >>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Any further suggestion? >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Thank you. >>> >>> > >>>>>>>> >>> >>> > >>>>>>>> Regards, >>> >>> > >>>>>>>> Edwin >>> >>> > >>>>>>>> >>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> >>> wrote: >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then >>> failing >>> >>> on >>> >>> > the >>> >>> > >>>>>>>> {2,} >>> >>> > >>>>>>>>> part you could try >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> If you also want to match CRLF then >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> Gesendet von Mail< >>> >>> https://go.microsoft.com/fwlink/?LinkId=550986 >>> >>> > > >>> >>> > >>>>>>>> für >>> >>> > >>>>>>>>> Windows 10 >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 >>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto: >>> >>> > solr-user@lucene.apache.org >>> >>> > >>> >>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to >>> detect >>> >>> > >> multiple >>> >>> > >>>>>>>> \n >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> Hi Paul, >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> Thanks for your reply. >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> When I use this pattern: >>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>> >>> > >>>>>>>>> <str name="fieldName">content</str> >>> >>> > >>>>>>>>> <str name="pattern">(\n+\s*){2,}</str> >>> >>> > >>>>>>>>> <str name="replacement"><br><br></str> >>> >>> > >>>>>>>>> </processor> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> It is working for some sentence within the same content >>> and >>> >>> not >>> >>> > >>>>>>>> working for >>> >>> > >>>>>>>>> some sentences. Please see below for the one that is >>> working >>> >>> and >>> >>> > >>>>>>>> another >>> >>> > >>>>>>>>> that is not working (partially working): >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is >>> >>> working >>> >>> > >>>>>>>> correctly >>> >>> > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am >>> >>> terminating >>> >>> > >>>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is >>> >>> partially >>> >>> > >>>>>>>> working >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> >>> > >>>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n >>> >>> > \n\n 3 >>> >>> > >>>>>>>> Choa >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >>> >>> > >>>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> >>> >>> > <br><br>3 >>> >>> > >>>>>>>> Choa >>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is >>> >>> partially >>> >>> > >>>>>>>> working >>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) >>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ >>> \n\n >>> >>> > >> \n\n >>> >>> > >>>>>>>> \n >>> >>> > >>>>>>>>> \n\n >>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On >>> >>> Tue, >>> >>> > Dec >>> >>> > >>>>>>>> 18, 2018 >>> >>> > >>>>>>>>> at 10:07 AM >>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ >>> >>> <br><br> >>> >>> > >>>>>>>> <br><br>On >>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong? >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> Thank you. >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>> Regards, >>> >>> > >>>>>>>>> Edwin >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> >>> wrote: >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not >>> working. I >>> >>> > assume >>> >>> > >>>>>>>> nothing >>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> ?? >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> Gesendet von Mail< >>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986> >>> >>> > >>>>>>>> für >>> >>> > >>>>>>>>>> Windows 10 >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> >>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 >>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto: >>> >>> > >> solr-user@lucene.apache.org >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect >>> >>> multiple >>> >>> > >> \n >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> Hi, >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to >>> >>> remove >>> >>> > more >>> >>> > >>>>>>>> than >>> >>> > >>>>>>>>> two >>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n >>> \n, >>> >>> \n >>> >>> > \n >>> >>> > >>>>>>>> \n >>> >>> > >>>>>>>>> \n), >>> >>> > >>>>>>>>>> and replace it with two <br>. >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> I use the following regex pattern and it is working >>> when I >>> >>> test >>> >>> > it >>> >>> > >>>>>>>> in >>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it >>> inside >>> >>> the >>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below: >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode"> >>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> >>> >>> > >>>>>>>>>> <str name="fieldName">content</str> >>> >>> > >>>>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> >>> >>> > >>>>>>>>>> <str name="replacement"><br><br></str> >>> >>> > >>>>>>>>>> </processor> >>> >>> > >>>>>>>>>> </updateRequestProcessorChain> >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is >>> >>> instructing >>> >>> > the >>> >>> > >>>>>>>> regex >>> >>> > >>>>>>>>> to >>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is >>> instructing >>> >>> the >>> >>> > >>>>>>>> regex to >>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n). >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should >>> I do >>> >>> it? >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> I am using Solr 7.6.0. >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>>> Regards, >>> >>> > >>>>>>>>>> Edwin >>> >>> > >>>>>>>>>> >>> >>> > >>>>>>>>> >>> >>> > >>>>>>>> >>> >>> > >>>>>>> >>> >>> > >> >>> >>> > >>> >>> >>> >> >>> >>