Hi Paul,

I have modified the second pattern to be (<br>){3,}, instead of
(<br><br>){3,}. This pattern of  (<br><br>){3,}
will actually look for 6 or more <br> instead of 3 <br>,  as we have put
the <br> two times in the pattern, which is the reason that there are more
<br> in the result, as cases where there are less than 6 <br> are not being
replaced, so we ended up having up to 5 <br> in the index.

Modified configuration:
 <processor class="solr.RegexReplaceProcessorFactory">
   <str name="fieldName">content</str>
   <str name="pattern">(&lt;br&gt;){3,}</str>
   <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
   <bool name="literalReplacement">true</bool>
 </processor>

This will bring us back to the result of the previous index content,
meaning the issue of having the 4 <br> is still there.

Regards,
Edwin



Regards,
Edwin

On Wed, 6 Mar 2019 at 11:37, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

> Hi Paul,
>
> Further to my previous email, which there was an extra "}" in the
> configuration, I have changed to use the below configuration based on your
> suggestion.
>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">[ \t]*\r?\n</str>
>    <str name="replacement">&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
> <processor class="solr.RegexReplaceProcessorFactory">
>    <str name="fieldName">content</str>
>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>    <bool name="literalReplacement">true</bool>
> </processor>
>
> However, the result that I get still has more than 2 <br>. In fact, the
> result become worse, as you can see from the comparison below.
>
> Example 1: The sentence that the regex pattern used to work correctly. But
> with the latest pattern, it has now changed from 2 <br> to become 5 <br>,
> which is wrong.
> *Original content in EML file:*
> Dear Sir,
>
>
> I am terminating
> *Original content:*    Dear Sir,  \n\n \n \n\n I am terminating
> *Previous Index content: *    Dear Sir,  <br><br>I am terminating
> *Current Index content*:   Dear Sir, <br><br><br><br><br> I am terminating
>
> Example 2: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>)
> *Original content in EML file:*
>
> *exalted*
>
> *Psalm 89:17*
>
>
> 3 Choa Chu Kang Avenue 4
> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n   \n\n  3 Choa
> Chu Kang Avenue 4, Singapore
> *Previous Index content: *exalted  <br><br>Psalm 89:17   <br><br>
> <br><br>3 Choa Chu Kang Avenue 4, Singapore
> *Current Index content*: <br><br><br>   Psalm 89:17<br><br>  <br><br>  3
> Choa Chu Kang Avenue 3, Singapor4
>
> Example 3: The sentence that the above regex pattern is partially working
> (as you can see, instead of 2 <br>, there are 4 <br>). For the latest code,
> there are now 5 <br>
> *Original content in EML file:*
>
> http://www.concorded.com/
>
>
>
>
>
>
>
>
> On Tue, Dec 18, 2018 at 10:07 AM
> *Original content:* http://www.concorded.com/   \n\n   \n\n \n \n\n \n\n
> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On Tue, Dec 18, 2018 at
> 10:07 AM
> *Previous Index content: *http://www.concorded.com/   <br><br>
> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
> *Current Index content:* http://www.concorded.com/<br><br>  <br><br><br>
> On Tue, Dec 18, 2018 at 10:07 AM
>
>
> Regards,
> Edwin
>
> On Wed, 6 Mar 2019 at 00:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
>
>> Hi Paul,
>>
>> Thank you for the reply.
>>
>> I have tried to add the following configuration according to your
>> suggestion:
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">[ \t]*\r?\n}</str>
>>    <str name="replacement">&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> <processor class="solr.RegexReplaceProcessorFactory">
>>    <str name="fieldName">content</str>
>>    <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>    <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>    <bool name="literalReplacement">true</bool>
>> </processor>
>>
>> However, none of the \n is being removed this time round.
>> Is the order and/or the pattern correct?
>>
>> Regards,
>> Edwin
>>
>> On Tue, 5 Mar 2019 at 19:54, <paul.d...@ub.unibe.ch> wrote:
>>
>>> Hi Edwin
>>>
>>>
>>>
>>> Try for the first pattern/replacement
>>>
>>>
>>>
>>> <str name="pattern">[ \t]*\r?\n</str>
>>>
>>> <str name="replacement">&lt;br&gt;</str>
>>>
>>>
>>>
>>> Now all line endings and preceding whitespace characters should be
>>> changed to ‘<br>’.
>>>
>>>
>>>
>>> The second pattern replacement should replace 3 or more ‘<br>’ sequences
>>> to 2 ‘<br>’ sequences:
>>>
>>>
>>>
>>> <str name="pattern">(&lt;br&gt;&lt;br&gt;){3,}</str>
>>>
>>> <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>>
>>>
>>>
>>> Hope this approach works. Sorry for not replying earlier and best
>>> regards,
>>>
>>> Paul
>>>
>>>
>>>
>>>
>>>
>>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986> für
>>> Windows 10
>>>
>>>
>>>
>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> Gesendet: Dienstag, 5. März 2019 03:35
>>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple \n
>>>
>>>
>>>
>>> Hi,
>>>
>>> For your info, this issue is occurring in the new Solr 7.7.1 as well.
>>>
>>> Regards,
>>> Edwin
>>>
>>> On Mon, 25 Feb 2019 at 10:28, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > Anyone else has other suggestions or have faced the same problem?
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> > On Wed, 20 Feb 2019 at 16:58, Zheng Lin Edwin Yeo <
>>> edwinye...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi Paul,
>>> >>
>>> >> If I tried to execute the second step first, then I will only get a
>>> >> single <br> for those with 2 <br>.
>>> >> For those that we originally get 4 <br>, there will be 2 <br> with a
>>> >> space in between.
>>> >>
>>> >> This is just changing the 2 <br> to be a single <br>, since the second
>>> >> step is to replace with a single <br>.
>>> >> But it has not solved the underlying problem yet.
>>> >>
>>> >> Regards,
>>> >> Edwin
>>> >>
>>> >>
>>> >> On Wed, 20 Feb 2019 at 16:41, <paul.d...@ub.unibe.ch> wrote:
>>> >>
>>> >>> If the second step is executed first, then you will get the unwanted
>>> 4
>>> >>> <br>
>>> >>>
>>> >>>
>>> >>>
>>> >>> Gesendet von Mail<https://go.microsoft.com/fwlink/?LinkId=550986>
>>> für
>>> >>> Windows 10
>>> >>>
>>> >>>
>>> >>>
>>> >>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> >>> Gesendet: Mittwoch, 20. Februar 2019 09:29
>>> >>> An: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
>>> >>> Betreff: Re: RegexReplaceProcessorFactory pattern to detect multiple
>>> \n
>>> >>>
>>> >>>
>>> >>>
>>> >>> Hi Jörn ,
>>> >>>
>>> >>> Do you mean the regex is not correct?
>>> >>>
>>> >>> We are already using two RegexReplaceProcessorFactory steps, like
>>> the one
>>> >>> shown below. The output that we get is still the same.
>>> >>>
>>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>>      <str name="fieldName">content</str>
>>> >>>      <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> >>>      <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>>      <bool name="literalReplacement">true</bool>
>>> >>> <processor>
>>> >>>
>>> >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>>      <str name="fieldName">content</str>
>>> >>>      <str name="pattern">([ \t]*\r?\n){1,}</str>
>>> >>>      <str name="replacement">&lt;br&gt;</str>
>>> >>>      <bool name="literalReplacement">true</bool>
>>> >>> <processor>
>>> >>>
>>> >>> Regards,
>>> >>> Edwin
>>> >>>
>>> >>> On Wed, 20 Feb 2019 at 16:03, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> > Then you need two regexprocessfactory steps
>>> >>> >
>>> >>> > > Am 20.02.2019 um 08:12 schrieb Zheng Lin Edwin Yeo <
>>> >>> edwinye...@gmail.com
>>> >>> > >:
>>> >>> > >
>>> >>> > > Hi,
>>> >>> > >
>>> >>> > > Thanks for the reply.
>>> >>> > >
>>> >>> > > Do you know of any regex online tool that works correctly for
>>> Java
>>> >>> regex?
>>> >>> > > I tried to find some, but they are not working properly.
>>> >>> > >
>>> >>> > > Yes, our plan is to replace more than one \n with <br><br>, and
>>> >>> single \n
>>> >>> > > with single <br>.
>>> >>> > >
>>> >>> > > Regards,
>>> >>> > > Edwin
>>> >>> > >
>>> >>> > >> On Wed, 20 Feb 2019 at 14:59, Jörn Franke <jornfra...@gmail.com
>>> >
>>> >>> wrote:
>>> >>> > >>
>>> >>> > >> Solr uses Java regex matching, so i doubt there is a bug - it
>>> would
>>> >>> then
>>> >>> > >> be in the JDK. Try out in a regex online Tool that supports Java
>>> >>> regex
>>> >>> > for
>>> >>> > >> your solution.
>>> >>> > >>
>>> >>> > >> I believe you want to have 2 regex process factories:
>>> >>> > >> One that deals with single \n and one that deals with more than
>>> one
>>> >>> \n
>>> >>> > >>
>>> >>> > >>> Am 20.02.2019 um 06:17 schrieb Zheng Lin Edwin Yeo <
>>> >>> > edwinye...@gmail.com
>>> >>> > >>> :
>>> >>> > >>>
>>> >>> > >>> Hi,
>>> >>> > >>>
>>> >>> > >>> We have tried with the following pattern ([ \t]*\r?\n){2,} and
>>> >>> > >>> configuration:
>>> >>> > >>>
>>> >>> > >>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>  <str name="fieldName">content</str>
>>> >>> > >>>  <str name="pattern">([ \t]*\r?\n){2,}</str>
>>> >>> > >>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>  <bool name="literalReplacement">true</bool>
>>> >>> > >>> </processor>
>>> >>> > >>>
>>> >>> > >>> However, the issue is still occurring.
>>> >>> > >>>
>>> >>> > >>> Anyone else is able to help?
>>> >>> > >>>
>>> >>> > >>> Regards,
>>> >>> > >>> Edwin
>>> >>> > >>>
>>> >>> > >>> On Fri, 15 Feb 2019 at 11:47, Zheng Lin Edwin Yeo <
>>> >>> > edwinye...@gmail.com>
>>> >>> > >>> wrote:
>>> >>> > >>>
>>> >>> > >>>> Hi,
>>> >>> > >>>>
>>> >>> > >>>> For your info, this issue is occurring in Solr 7.7.0 as well.
>>> >>> > >>>>
>>> >>> > >>>> Regards,
>>> >>> > >>>> Edwin
>>> >>> > >>>>
>>> >>> > >>>> On Tue, 12 Feb 2019 at 00:10, Zheng Lin Edwin Yeo <
>>> >>> > edwinye...@gmail.com
>>> >>> > >>>
>>> >>> > >>>> wrote:
>>> >>> > >>>>
>>> >>> > >>>>> Hi,
>>> >>> > >>>>>
>>> >>> > >>>>> Should we report this as a bug in Solr?
>>> >>> > >>>>>
>>> >>> > >>>>> Regards,
>>> >>> > >>>>> Edwin
>>> >>> > >>>>>
>>> >>> > >>>>> On Fri, 8 Feb 2019 at 22:18, Zheng Lin Edwin Yeo <
>>> >>> > edwinye...@gmail.com
>>> >>> > >>>
>>> >>> > >>>>> wrote:
>>> >>> > >>>>>
>>> >>> > >>>>>> Hi Paul,
>>> >>> > >>>>>>
>>> >>> > >>>>>> Regarding the regex (\n\s*){2,} that we are using, when we
>>> try
>>> >>> in on
>>> >>> > >>>>>> https://regex101.com/, it is able to give us the correct
>>> >>> result for
>>> >>> > >> all
>>> >>> > >>>>>> the examples (ie: All of them will only have <br><br>, and
>>> not
>>> >>> more
>>> >>> > >> than
>>> >>> > >>>>>> that like what we are getting in Solr in our earlier
>>> examples).
>>> >>> > >>>>>>
>>> >>> > >>>>>> Could there be a possibility of a bug in Solr?
>>> >>> > >>>>>>
>>> >>> > >>>>>> Regards,
>>> >>> > >>>>>> Edwin
>>> >>> > >>>>>>
>>> >>> > >>>>>> On Fri, 8 Feb 2019 at 00:33, Zheng Lin Edwin Yeo <
>>> >>> > >> edwinye...@gmail.com>
>>> >>> > >>>>>> wrote:
>>> >>> > >>>>>>
>>> >>> > >>>>>>> Hi Paul,
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> We have tried it with the space preceeding the \n i.e. <str
>>> >>> > >>>>>>> name="pattern">(\s*\n){2,}</str>, with the following regex
>>> >>> pattern:
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>  <str name="pattern">(\s*\n){2,}</str>
>>> >>> > >>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>> </processor>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> However, we are also getting the exact same results as the
>>> >>> earlier
>>> >>> > >>>>>>> Example 1, 2 and 3.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> As for your point 2 on perhaps in the data you have other
>>> (non
>>> >>> > >>>>>>> printing) characters than \n, we have find that there are
>>> no
>>> >>> non
>>> >>> > >> printing
>>> >>> > >>>>>>> characters. It is just next line with a space. You can
>>> refer
>>> >>> to the
>>> >>> > >>>>>>> original content in the same examples below.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Example 1: The sentence that the above regex pattern is
>>> working
>>> >>> > >>>>>>> correctly
>>> >>> > >>>>>>> *Original content in EML file:*
>>> >>> > >>>>>>> Dear Sir,
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> I am terminating
>>> >>> > >>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> terminating
>>> >>> > >>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Example 2: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> >>> > >>>>>>> *Original content in EML file:*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> *exalted*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> *Psalm 89:17*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> 3 Choa Chu Kang Avenue 4
>>> >>> > >>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >>>  \n\n  3
>>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> >>> <br><br>3
>>> >>> > >>>>>>> Choa Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Example 3: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>> working (as you can see, instead of 2 <br>, there are 4
>>> <br>)
>>> >>> > >>>>>>> *Original content in EML file:*
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> http://www.concordpri.moe.edu.sg/
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> On Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>>  \n\n
>>> >>> >  \n\n
>>> >>> > >> \n
>>> >>> > >>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n
>>> \n\n\n  On
>>> >>> Tue,
>>> >>> > >> Dec 18,
>>> >>> > >>>>>>> 2018 at 10:07 AM
>>> >>> > >>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>  <br><br>
>>> >>> > >>>>>>> <br><br>On Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Appreciate any other ideas or suggestions that you may
>>> have.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Thank you.
>>> >>> > >>>>>>>
>>> >>> > >>>>>>> Regards,
>>> >>> > >>>>>>> Edwin
>>> >>> > >>>>>>>
>>> >>> > >>>>>>>> On Thu, 7 Feb 2019 at 22:49, <paul.d...@ub.unibe.ch>
>>> wrote:
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Hi Edwin
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> 1.  Sorry, the pattern was wrong, the space should preceed
>>> >>> the \n
>>> >>> > >>>>>>>> i.e. <str name="pattern">(\s*\n){2,}</str>
>>> >>> > >>>>>>>> 2.  Perhaps in the data you have other (non printing)
>>> >>> characters
>>> >>> > >>>>>>>> than \n?
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Gesendet von Mail<
>>> >>> https://go.microsoft.com/fwlink/?LinkId=550986>
>>> >>> > >> für
>>> >>> > >>>>>>>> Windows 10
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> >>> > >>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:23
>>> >>> > >>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> >>> > solr-user@lucene.apache.org>
>>> >>> > >>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>>> detect
>>> >>> > >> multiple \n
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Hi Paul,
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> We have tried this suggested regex pattern as follow:
>>> >>> > >>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>>  <str name="pattern">(\n\s*){2,}</str>
>>> >>> > >>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>>> </processor>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> But we still have exactly the same problem of Example 1,2
>>> and
>>> >>> 3
>>> >>> > >> below.
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Example 1: The sentence that the above regex pattern is
>>> >>> working
>>> >>> > >>>>>>>> correctly
>>> >>> > >>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> >>> terminating
>>> >>> > >>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Example 2: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >>>  \n\n
>>> >>> > 3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> >>> > <br><br>3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Example 3: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>>  \n\n
>>> >>> >  \n\n
>>> >>> > >>>>>>>> \n \n\n
>>> >>> > >>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> >>> Tue, Dec
>>> >>> > >> 18,
>>> >>> > >>>>>>>> 2018
>>> >>> > >>>>>>>> at 10:07 AM
>>> >>> > >>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>>  <br><br>
>>> >>> > >>>>>>>> <br><br>On
>>> >>> > >>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Any further suggestion?
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Thank you.
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>> Regards,
>>> >>> > >>>>>>>> Edwin
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>>> On Thu, 7 Feb 2019 at 22:20, <paul.d...@ub.unibe.ch>
>>> wrote:
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> To avoid the «\n+\s*» matching too many \n and then
>>> failing
>>> >>> on
>>> >>> > the
>>> >>> > >>>>>>>> {2,}
>>> >>> > >>>>>>>>> part you could try
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> <str name="pattern">(\n\s*){2,}</str>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> If you also want to match CRLF then
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> <str name="pattern">(\r?\n\s*){2,}</str>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Gesendet von Mail<
>>> >>> https://go.microsoft.com/fwlink/?LinkId=550986
>>> >>> > >
>>> >>> > >>>>>>>> für
>>> >>> > >>>>>>>>> Windows 10
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> >>> > >>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 15:10
>>> >>> > >>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> >>> > solr-user@lucene.apache.org
>>> >>> > >>>
>>> >>> > >>>>>>>>> Betreff: Re: RegexReplaceProcessorFactory pattern to
>>> detect
>>> >>> > >> multiple
>>> >>> > >>>>>>>> \n
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Hi Paul,
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Thanks for your reply.
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> When I use this pattern:
>>> >>> > >>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>>>  <str name="pattern">(\n+\s*){2,}</str>
>>> >>> > >>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>>>> </processor>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> It is working for some sentence within the same content
>>> and
>>> >>> not
>>> >>> > >>>>>>>> working for
>>> >>> > >>>>>>>>> some sentences. Please see below for the one that is
>>> working
>>> >>> and
>>> >>> > >>>>>>>> another
>>> >>> > >>>>>>>>> that is not working (partially working):
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Example 1: The sentence that the above regex pattern is
>>> >>> working
>>> >>> > >>>>>>>> correctly
>>> >>> > >>>>>>>>> *Original content:*    Dear Sir,  \n\n \n \n\n I am
>>> >>> terminating
>>> >>> > >>>>>>>>> *Index content: *    Dear Sir,  <br><br>I am terminating
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Example 2: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>>> *Original content:* exalted  \n \n\n   Psalm 89:17   \n\n
>>> >>> >  \n\n  3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>>> *Index content: *exalted  <br><br>Psalm 89:17   <br><br>
>>> >>> > <br><br>3
>>> >>> > >>>>>>>> Choa
>>> >>> > >>>>>>>>> Chu Kang Avenue 4, Singapore
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Example 3: The sentence that the above regex pattern is
>>> >>> partially
>>> >>> > >>>>>>>> working
>>> >>> > >>>>>>>>> (as you can see, instead of 2 <br>, there are 4 <br>)
>>> >>> > >>>>>>>>> *Original content:* http://www.concordpri.moe.edu.sg/
>>>  \n\n
>>> >>> > >> \n\n
>>> >>> > >>>>>>>> \n
>>> >>> > >>>>>>>>> \n\n
>>> >>> > >>>>>>>>> \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n\n \n\n\n  On
>>> >>> Tue,
>>> >>> > Dec
>>> >>> > >>>>>>>> 18, 2018
>>> >>> > >>>>>>>>> at 10:07 AM
>>> >>> > >>>>>>>>> *Index content: *http://www.concordpri.moe.edu.sg/
>>> >>>  <br><br>
>>> >>> > >>>>>>>> <br><br>On
>>> >>> > >>>>>>>>> Tue, Dec 18, 2018 at 10:07 AM
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> We would appreciate your help to see what is wrong?
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Thank you.
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>> Regards,
>>> >>> > >>>>>>>>> Edwin
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>> On Thu, 7 Feb 2019 at 21:24, <paul.d...@ub.unibe.ch>
>>> wrote:
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> You don’t say what happens, just that it is not
>>> working. I
>>> >>> > assume
>>> >>> > >>>>>>>> nothing
>>> >>> > >>>>>>>>>> is replaced? Perhaps the pattern should be
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>  <str name="pattern">"(\n\s*){2,}"</str>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> ??
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Gesendet von Mail<
>>> >>> > https://go.microsoft.com/fwlink/?LinkId=550986>
>>> >>> > >>>>>>>> für
>>> >>> > >>>>>>>>>> Windows 10
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Von: Zheng Lin Edwin Yeo<mailto:edwinye...@gmail.com>
>>> >>> > >>>>>>>>>> Gesendet: Donnerstag, 7. Februar 2019 14:08
>>> >>> > >>>>>>>>>> An: solr-user@lucene.apache.org<mailto:
>>> >>> > >> solr-user@lucene.apache.org
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>>> Betreff: RegexReplaceProcessorFactory pattern to detect
>>> >>> multiple
>>> >>> > >> \n
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Hi,
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> I am trying to use the RegexReplaceProcessorFactory to
>>> >>> remove
>>> >>> > more
>>> >>> > >>>>>>>> than
>>> >>> > >>>>>>>>> two
>>> >>> > >>>>>>>>>> \n with any number of spaces between them (Eg: \n\n, \n
>>> \n,
>>> >>> \n
>>> >>> > \n
>>> >>> > >>>>>>>> \n
>>> >>> > >>>>>>>>> \n),
>>> >>> > >>>>>>>>>> and replace it with two <br>.
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> I use the following regex pattern and it is working
>>> when I
>>> >>> test
>>> >>> > it
>>> >>> > >>>>>>>> in
>>> >>> > >>>>>>>>>> regex101.com. But it is not working when I put it
>>> inside
>>> >>> the
>>> >>> > >>>>>>>>>> RegexReplaceProcessorFactory as below:
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> <updateRequestProcessorChain name="removeCode">
>>> >>> > >>>>>>>>>> <processor class="solr.RegexReplaceProcessorFactory">
>>> >>> > >>>>>>>>>>  <str name="fieldName">content</str>
>>> >>> > >>>>>>>>>>  <str name="pattern">"(\\n\s*){2,}"</str>
>>> >>> > >>>>>>>>>>  <str name="replacement">&lt;br&gt;&lt;br&gt;</str>
>>> >>> > >>>>>>>>>> </processor>
>>> >>> > >>>>>>>>>>         </updateRequestProcessorChain>
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> To explain further about my regex pattern, \s* is
>>> >>> instructing
>>> >>> > the
>>> >>> > >>>>>>>> regex
>>> >>> > >>>>>>>>> to
>>> >>> > >>>>>>>>>> match any \n that have space after and {2,} is
>>> instructing
>>> >>> the
>>> >>> > >>>>>>>> regex to
>>> >>> > >>>>>>>>>> match 2 or more occurrence of such pattern (\n).
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Please kindly let me know what is wrong and how should
>>> I do
>>> >>> it?
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> I am using Solr 7.6.0.
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>> Regards,
>>> >>> > >>>>>>>>>> Edwin
>>> >>> > >>>>>>>>>>
>>> >>> > >>>>>>>>>
>>> >>> > >>>>>>>>
>>> >>> > >>>>>>>
>>> >>> > >>
>>> >>> >
>>> >>>
>>> >>
>>>
>>

Reply via email to