Hi Paul, If I tried to execute the second step first, then I will only get a single <br> for those with 2 <br>. For those that we originally get 4 <br>, there will be 2 <br> with a space in between.
This is just changing the 2 <br> to be a single <br>, since the second step is to replace with a single <br>. But it has not solved the underlying problem yet. Regards, Edwin On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch> wrote: > If the second step is executed first, then you will get the unwanted 4 <br> > > > > Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für > Windows 10 > > > > Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > Gesendet: Mittwoch, 20. Februar 2019 09:29 > An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> > Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n > > > > Hi Jörn , > > Do you mean the regex is not correct? > > We are already using two RegexReplaceProcessorFactory steps, like the one > shown below. The output that we get is still the same. > > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">([ \t]*\r?\n){2,}</str> > <str name="replacement"><br><br></str> > <bool name="literalReplacement">true</bool> > <processor> > > <processor class="solr.RegexReplaceProcessorFactory"> > <str name="fieldName">content</str> > <str name="pattern">([ \t]*\r?\n){1,}</str> > <str name="replacement"><br></str> > <bool name="literalReplacement">true</bool> > <processor> > > Regards, > Edwin > > On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfra...@gmail.com> wrote: > > > Then you need two regexprocessfactory steps > > > > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo < > edwinye...@gmail.com > > >: > > > > > > Hi, > > > > > > Thanks for the reply. > > > > > > Do you know of any regex online tool that works correctly for Java > regex? > > > I tried to find some, but they are not working properly. > > > > > > Yes, our plan is to replace more than one \n with <br><br>, and single > \n > > > with single <br>. > > > > > > Regards, > > > Edwin > > > > > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfra...@gmail.com> > wrote: > > >> > > >> Solr uses Java regex matching, so i doubt there is a bug - it would > then > > >> be in the JDK. Try out in a regex online Tool that supports Java regex > > for > > >> your solution. > > >> > > >> I believe you want to have 2 regex process factories: > > >> One that deals with single \n and one that deals with more than one \n > > >> > > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo < > > edwinye...@gmail.com > > >>> : > > >>> > > >>> Hi, > > >>> > > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and > > >>> configuration: > > >>> > > >>> <processor class="solr.RegexReplaceProcessorFactory"> > > >>> <str name="fieldName">content</str> > > >>> <str name="pattern">([ \t]*\r?\n){2,}</str> > > >>> <str name="replacement"><br><br></str> > > >>> <bool name="literalReplacement">true</bool> > > >>> </processor> > > >>> > > >>> However, the issue is still occurring. > > >>> > > >>> Anyone else is able to help? > > >>> > > >>> Regards, > > >>> Edwin > > >>> > > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo < > > edwinye...@gmail.com> > > >>> wrote: > > >>> > > >>>> Hi, > > >>>> > > >>>> For your info, this issue is occurring in Solr 7.7.0 as well. > > >>>> > > >>>> Regards, > > >>>> Edwin > > >>>> > > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo < > > edwinye...@gmail.com > > >>> > > >>>> wrote: > > >>>> > > >>>>> Hi, > > >>>>> > > >>>>> Should we report this as a bug in Solr? > > >>>>> > > >>>>> Regards, > > >>>>> Edwin > > >>>>> > > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo < > > edwinye...@gmail.com > > >>> > > >>>>> wrote: > > >>>>> > > >>>>>> Hi Paul, > > >>>>>> > > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we try in > on > > >>>>>> https://regex101.com/, it is able to give us the correct result > for > > >> all > > >>>>>> the examples (ie: All of them will only have <br><br>, and not > more > > >> than > > >>>>>> that like what we are getting in Solr in our earlier examples). > > >>>>>> > > >>>>>> Could there be a possibility of a bug in Solr? > > >>>>>> > > >>>>>> Regards, > > >>>>>> Edwin > > >>>>>> > > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo < > > >> edwinye...@gmail.com> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Hi Paul, > > >>>>>>> > > >>>>>>> We have tried it with the space preceeding the \n i.e. <str > > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex > pattern: > > >>>>>>> > > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > > >>>>>>> <str name="fieldName">content</str> > > >>>>>>> <str name="pattern">(\s*\n){2,}</str> > > >>>>>>> <str name="replacement"><br><br></str> > > >>>>>>> </processor> > > >>>>>>> > > >>>>>>> However, we are also getting the exact same results as the > earlier > > >>>>>>> Example 1, 2 and 3. > > >>>>>>> > > >>>>>>> As for your point 2 on perhaps in the data you have other (non > > >>>>>>> printing) characters than \n, we have find that there are no non > > >> printing > > >>>>>>> characters. It is just next line with a space. You can refer to > the > > >>>>>>> original content in the same examples below. > > >>>>>>> > > >>>>>>> > > >>>>>>> Example 1: The sentence that the above regex pattern is working > > >>>>>>> correctly > > >>>>>>> *Original content in EML file:* > > >>>>>>> Dear Sir, > > >>>>>>> > > >>>>>>> > > >>>>>>> I am terminating > > >>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating > > >>>>>>> *Index content: * Dear Sir, <br><br>I am terminating > > >>>>>>> > > >>>>>>> Example 2: The sentence that the above regex pattern is partially > > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) > > >>>>>>> *Original content in EML file:* > > >>>>>>> > > >>>>>>> *exalted* > > >>>>>>> > > >>>>>>> *Psalm 89:17* > > >>>>>>> > > >>>>>>> > > >>>>>>> 3 Choa Chu Kang Avenue 4 > > >>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n > \n\n 3 > > >>>>>>> Choa Chu Kang Avenue 4, Singapore > > >>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> > <br><br>3 > > >>>>>>> Choa Chu Kang Avenue 4, Singapore > > >>>>>>> > > >>>>>>> Example 3: The sentence that the above regex pattern is partially > > >>>>>>> working (as you can see, instead of 2 <br>, there are 4 <br>) > > >>>>>>> *Original content in EML file:* > > >>>>>>> > > >>>>>>> http://www.concordpri.moe.edu.sg/ > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM > > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n > > \n\n > > >> \n > > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On > Tue, > > >> Dec 18, > > >>>>>>> 2018 at 10:07 AM > > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM > > >>>>>>> > > >>>>>>> > > >>>>>>> Appreciate any other ideas or suggestions that you may have. > > >>>>>>> > > >>>>>>> Thank you. > > >>>>>>> > > >>>>>>> Regards, > > >>>>>>> Edwin > > >>>>>>> > > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch> wrote: > > >>>>>>>> > > >>>>>>>> Hi Edwin > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> 1. Sorry, the pattern was wrong, the space should preceed the > \n > > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str> > > >>>>>>>> 2. Perhaps in the data you have other (non printing) characters > > >>>>>>>> than \n? > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Gesendet von Mail< > https://go.microsoft.com/fwlink/?LinkId=550986> > > >> für > > >>>>>>>> Windows 10 > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23 > > >>>>>>>> An: solr-user@lucene.apache.org<mailto: > > solr-user@lucene.apache.org> > > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect > > >> multiple \n > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Hi Paul, > > >>>>>>>> > > >>>>>>>> We have tried this suggested regex pattern as follow: > > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > > >>>>>>>> <str name="fieldName">content</str> > > >>>>>>>> <str name="pattern">(\n\s*){2,}</str> > > >>>>>>>> <str name="replacement"><br><br></str> > > >>>>>>>> </processor> > > >>>>>>>> > > >>>>>>>> But we still have exactly the same problem of Example 1,2 and 3 > > >> below. > > >>>>>>>> > > >>>>>>>> Example 1: The sentence that the above regex pattern is working > > >>>>>>>> correctly > > >>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating > > >>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating > > >>>>>>>> > > >>>>>>>> Example 2: The sentence that the above regex pattern is > partially > > >>>>>>>> working > > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > > >>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n \n\n > > 3 > > >>>>>>>> Choa > > >>>>>>>> Chu Kang Avenue 4, Singapore > > >>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> > > <br><br>3 > > >>>>>>>> Choa > > >>>>>>>> Chu Kang Avenue 4, Singapore > > >>>>>>>> > > >>>>>>>> Example 3: The sentence that the above regex pattern is > partially > > >>>>>>>> working > > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n > > \n\n > > >>>>>>>> \n \n\n > > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, > Dec > > >> 18, > > >>>>>>>> 2018 > > >>>>>>>> at 10:07 AM > > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > > >>>>>>>> <br><br>On > > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM > > >>>>>>>> > > >>>>>>>> Any further suggestion? > > >>>>>>>> > > >>>>>>>> Thank you. > > >>>>>>>> > > >>>>>>>> Regards, > > >>>>>>>> Edwin > > >>>>>>>> > > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch> wrote: > > >>>>>>>>> > > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then failing on > > the > > >>>>>>>> {2,} > > >>>>>>>>> part you could try > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> If you also want to match CRLF then > > >>>>>>>>> > > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Gesendet von Mail< > https://go.microsoft.com/fwlink/?LinkId=550986 > > > > > >>>>>>>> für > > >>>>>>>>> Windows 10 > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10 > > >>>>>>>>> An: solr-user@lucene.apache.org<mailto: > > solr-user@lucene.apache.org > > >>> > > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect > > >> multiple > > >>>>>>>> \n > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Hi Paul, > > >>>>>>>>> > > >>>>>>>>> Thanks for your reply. > > >>>>>>>>> > > >>>>>>>>> When I use this pattern: > > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > > >>>>>>>>> <str name="fieldName">content</str> > > >>>>>>>>> <str name="pattern">(\n+\s*){2,}</str> > > >>>>>>>>> <str name="replacement"><br><br></str> > > >>>>>>>>> </processor> > > >>>>>>>>> > > >>>>>>>>> It is working for some sentence within the same content and not > > >>>>>>>> working for > > >>>>>>>>> some sentences. Please see below for the one that is working > and > > >>>>>>>> another > > >>>>>>>>> that is not working (partially working): > > >>>>>>>>> > > >>>>>>>>> Example 1: The sentence that the above regex pattern is working > > >>>>>>>> correctly > > >>>>>>>>> *Original content:* Dear Sir, \n\n \n \n\n I am terminating > > >>>>>>>>> *Index content: * Dear Sir, <br><br>I am terminating > > >>>>>>>>> > > >>>>>>>>> Example 2: The sentence that the above regex pattern is > partially > > >>>>>>>> working > > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > > >>>>>>>>> *Original content:* exalted \n \n\n Psalm 89:17 \n\n > > \n\n 3 > > >>>>>>>> Choa > > >>>>>>>>> Chu Kang Avenue 4, Singapore > > >>>>>>>>> *Index content: *exalted <br><br>Psalm 89:17 <br><br> > > <br><br>3 > > >>>>>>>> Choa > > >>>>>>>>> Chu Kang Avenue 4, Singapore > > >>>>>>>>> > > >>>>>>>>> Example 3: The sentence that the above regex pattern is > partially > > >>>>>>>> working > > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>) > > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/ \n\n > > >> \n\n > > >>>>>>>> \n > > >>>>>>>>> \n\n > > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n On Tue, > > Dec > > >>>>>>>> 18, 2018 > > >>>>>>>>> at 10:07 AM > > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/ <br><br> > > >>>>>>>> <br><br>On > > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM > > >>>>>>>>> > > >>>>>>>>> We would appreciate your help to see what is wrong? > > >>>>>>>>> > > >>>>>>>>> Thank you. > > >>>>>>>>> > > >>>>>>>>> Regards, > > >>>>>>>>> Edwin > > >>>>>>>>> > > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch> wrote: > > >>>>>>>>>> > > >>>>>>>>>> You don’t say what happens, just that it is not working. I > > assume > > >>>>>>>> nothing > > >>>>>>>>>> is replaced? Perhaps the pattern should be > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> <str name="pattern">"(\n\s*){2,}"</str> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> ?? > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Gesendet von Mail< > > https://go.microsoft.com/fwlink/?LinkId=550986> > > >>>>>>>> für > > >>>>>>>>>> Windows 10 > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com> > > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08 > > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto: > > >> solr-user@lucene.apache.org > > >>>>>>>>> > > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect > multiple > > >> \n > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Hi, > > >>>>>>>>>> > > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to remove > > more > > >>>>>>>> than > > >>>>>>>>> two > > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n \n, \n > > \n > > >>>>>>>> \n > > >>>>>>>>> \n), > > >>>>>>>>>> and replace it with two <br>. > > >>>>>>>>>> > > >>>>>>>>>> I use the following regex pattern and it is working when I > test > > it > > >>>>>>>> in > > >>>>>>>>>> regex101.com. But it is not working when I put it inside the > > >>>>>>>>>> RegexReplaceProcessorFactory as below: > > >>>>>>>>>> > > >>>>>>>>>> <updateRequestProcessorChain name="removeCode"> > > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory"> > > >>>>>>>>>> <str name="fieldName">content</str> > > >>>>>>>>>> <str name="pattern">"(\\n\s*){2,}"</str> > > >>>>>>>>>> <str name="replacement"><br><br></str> > > >>>>>>>>>> </processor> > > >>>>>>>>>> </updateRequestProcessorChain> > > >>>>>>>>>> > > >>>>>>>>>> To explain further about my regex pattern, \s* is instructing > > the > > >>>>>>>> regex > > >>>>>>>>> to > > >>>>>>>>>> match any \n that have space after and {2,} is instructing the > > >>>>>>>> regex to > > >>>>>>>>>> match 2 or more occurrence of such pattern (\n). > > >>>>>>>>>> > > >>>>>>>>>> Please kindly let me know what is wrong and how should I do > it? > > >>>>>>>>>> > > >>>>>>>>>> I am using Solr 7.6.0. > > >>>>>>>>>> > > >>>>>>>>>> Regards, > > >>>>>>>>>> Edwin > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >> > > >