Hi all,

Apologies, I didn't realize my correspondence with Bruce didn't get added 
to the whole thread! My use case for this grep search was that I am 
transferring text from PDF to .txt to do text mining. In transferring from 
PDF to .txt many of the files were duplicating words, as an example:

Here isis anan exampleexample of the way inin whichwhich somesome of the 
texttext wouldwould be transferred to .txt

I was trying to use a grep search to find the duplicated words and replace 
them with just a single instance of that word, meaning: take 
"exampleexample" and change it to "example"

Bruce's solution,  '\b(\w+)\1\b’ did end up working for me. 

I had also contacted BBedit's help service, and they said that the error 
code 12247 was a "match limit exceeded" error and suspected it was due to 
the number of instances the original grep search was finding.

thanks all!

On Sunday, December 15, 2024 at 9:35:02 PM UTC-5 GP wrote:

> Yes, that's the way the original  '\b(\w+)+\1\b’ works in practice. The 
> first zero or more word characters aren't captured but are included in the 
> match so they get deleted in the replacement of using just the \1 capture 
> group. Since \w* and (?:\w+)* are match equivalent in practice, perhaps the 
> expression '\b(?:\w+)*(\w+)\1\b’ will better explain how it is just the 
> last iteration match of (\w+) of the (\w+)+\1 expression that is captured 
> and how any and all of the preceding groupings of \w+ matches, if any, are 
> discarded as captures and aren't included in the one, final capture group.
>
> Take for example the word facilisis. The regular expression engine ends up 
> finding a leading match on - facil - a group 1 capturing match on - is - 
> and a non-capturing match to capture group 1 - is . The whole matched word 
> string then gets replaced by just the capture group 1 string of 'is' 
> (without the quotes). Your guess is as good as mine as to how much string 
> slicing and dicing; capturing and capture discarding the engine is 
> performing before arriving at that match and capture group solution.
>
> That said, I flubbed the copy and paste in the last of that comment 
> discussing backtracking. I intended to use '\b(\w+)+\1\b’ for the 
> backtracking comment part but instead copied and pasted '\b\w*(\w+)\1\b’. 
> As it turns out both have a whole lot of backtracking but '\b\w*(\w+)\1\b’ 
> has slightly less backtracking than '\b(\w+)+\1\b’ on the example search 
> text I was using.
>
> On Sunday, December 15, 2024 at 4:14:09 PM UTC-8 Bruce Van Allen wrote:
>
>> Thanks for digging into the regex meaning of that second ‘+’ in 
>> '\b(\w+)+\b’. 
>>
>> As it turned out, the OP needed to find repeated words, not characters, 
>> so inserting a spacebar space for the second plus sign totally works for 
>> them. 
>>
>> Also, I’m not sure you’re suggesting this but at the end of your comment 
>> you’re talking about the pattern '\b\w*(\w+)\1\b’. That first zero or more 
>> word characters - \w* - won’t be captured and so won’t be in the 
>> replacement pattern. Is that what you meant? 
>>
>> Best, 
>>
>> — Bruce 
>>
>> _bruce__van_allen__santa_cruz_ca_ 
>>
>>
>> > On Dec 15, 2024, at 3:50 PM, GP <[email protected]> wrote: 
>> > 
>> > First with BBEdit 15.1.3 (15B62, Apple Silicon) I didn't get any error 
>> with ce gm's grep find and replace. 
>> > 
>> > That said, however, I found the second + is doing something in the find 
>> and replace operation. 
>> > 
>> > Using Howard's posted sample records test from the "Sorting multiple 
>> records in a text file" for testing text. Using the Pattern Playground with 
>> the find: '\b(\w+)+\1\b’ (without the quotes) and replace: \1 pattern, 7 
>> matches were found: 
>> > 0 -> facilisis 
>> > 1 -> is 
>> > replacement -> is 
>> > 
>> > 0 -> Underhill 
>> > 1 -> l 
>> > replacement -> l 
>> > 
>> > 0 -> 11 
>> > 1 -> 1 
>> > replacement -> 1 
>> > 
>> > 0 -> Afterall 
>> > 1 -> l 
>> > replacement -> l 
>> > 
>> > 0 -> 11 
>> > 1 -> 1 
>> > replacement -> 1 
>> > 
>> > 0 -> 22 
>> > 1 -> 2 
>> > replacement -> 2 
>> > 
>> > 0 -> Afterall 
>> > 1 -> l 
>> > replacement -> l 
>> > 
>> > whereas, with the find: '\b(\w+)\1\b’ (without the second + and without 
>> the quotes) and same replace pattern, only 3 matches were found: 
>> > 0 -> 11 
>> > 1 -> 1 
>> > replacement -> 1 
>> > 
>> > 0 -> 11 
>> > 1 -> 1 
>> > replacement -> 1 
>> > 
>> > 0 -> 22 
>> > 1 -> 2 
>> > replacement -> 2 
>> > 
>> > According to https://regex101.com's explanation, the difference is due 
>> to the capturing group workings of the (\w+)+ part of the regular 
>> expression: "A repeated capturing group will only capture the last 
>> iteration." So, if I'm not mistaken, the workings of (\w+)+ is equivalent 
>> to \w*(\w+) and the equivalent find grep is \b\w*(\w+)\1\b . That would 
>> match any word string containing zero or more word characters followed by a 
>> capturing group of one or more word characters followed by a single repeat 
>> of the captured group of characters. According to regex101.com's Regex 
>> Debugger there's a whole lot of backtracking going on to find all the 
>> matches with the \b\w*(\w+)\1\b grep. 
>> > On Saturday, December 14, 2024 at 3:07:35 PM UTC-8 Bruce Van Allen 
>> wrote: 
>> > Hi, 
>> > 
>> > An example of the text and a description of what you’re trying to 
>> accomplish would help. 
>> > 
>> > From your find pattern, I’m guessing you’re trying to find cases where 
>> a string is followed by the same string, to be replaced by just one 
>> instance of the string. 
>> > 
>> > '\b(\w+)+\1\b’ (your original - without the quotes) 
>> > 
>> > Your find pattern’s second plus sign ‘+’ isn’t doing anything, because 
>> the first one, which quantifies the ‘\w’, is grabbing every consecutive 
>> word/alphanumeric character including any repetitions. 
>> > 
>> > Removing that second ‘+', the find pattern '\b(\w+)\1\b’ (without the 
>> quotes) will find a string of word characters followed immediately by the 
>> same string, as in ‘My sentence is abcabc for defdef.’ Using your 
>> replacement pattern of ‘\1’, this will become ‘My sentence is abc for def.’ 
>> > 
>> > Guessing that you’re are actually looking for duplicated WORDS, if the 
>> find pattern has a spacebar space ‘ ‘ then it will find any word followed 
>> by a space and then the same exact word, and the replacement will eliminate 
>> the duplication. 
>> > 
>> > With find pattern '\b(\w+) \1\b’, your replacement pattern makes 'My 
>> sentence is abc abc for def def.’ into 'My sentence is abc for def.’ 
>> > 
>> > If you want to find a string of word characters that matches an earlier 
>> instance of the same string but separated by more than just a space, your 
>> pattern may be more complicated. 
>> > 
>> > HTH and please clarify if my guesses are wrong. 
>> > 
>> > — Bruce 
>> > 
>> > _bruce__van_allen__santa_cruz_ca_ 
>> > 
>> > 
>> > > On Dec 14, 2024, at 1:43 PM, ce gm <[email protected]> wrote: 
>> > > 
>> > > Hello there, 
>> > > 
>> > > I am doing a GREP search on a .txt file in Bbedit on my Mac. Here are 
>> the find/replace terms: 
>> > > Find: \b(\w+)+\1\b 
>> > > Replace: \1 
>> > > 
>> > > When I input the Find term, it correctly identifies the targets in 
>> the preview (highlights them in yellow). Then, when I push Replace All, I 
>> get a pop up with Application Error Code: 12247 and nothing else. 
>> > > 
>> > > Anyone know what this means? A cursory Google search was not helpful. 
>> > > 
>> > > Thanks! 
>> > > 
>> > > -- 
>> > > This is the BBEdit Talk public discussion group. If you have a 
>> feature request or believe that the application isn't working correctly, 
>> please email "[email protected]" rather than posting here. Follow 
>> @bbedit on Mastodon: <https://mastodon.social/@bbedit> 
>> > > --- 
>> > > You received this message because you are subscribed to the Google 
>> Groups "BBEdit Talk" group. 
>> > > To unsubscribe from this group and stop receiving emails from it, 
>> send an email to [email protected]. 
>> > > To view this discussion visit 
>> https://groups.google.com/d/msgid/bbedit/c9e18d6f-f5c4-467e-9c01-fa4ffbaa5485n%40googlegroups.com.
>>  
>>
>> > 
>> > 
>> > -- 
>> > This is the BBEdit Talk public discussion group. If you have a feature 
>> request or believe that the application isn't working correctly, please 
>> email "[email protected]" rather than posting here. Follow @bbedit on 
>> Mastodon: <https://mastodon.social/@bbedit> 
>> > --- 
>> > You received this message because you are subscribed to the Google 
>> Groups "BBEdit Talk" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an email to [email protected]. 
>> > To view this discussion visit 
>> https://groups.google.com/d/msgid/bbedit/72b08e6c-5ac8-478c-8f54-9baddaeb18een%40googlegroups.com.
>>  
>>
>>
>>

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or believe that the application isn't working correctly, please email 
"[email protected]" rather than posting here. Follow @bbedit on Mastodon: 
<https://mastodon.social/@bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/bbedit/a267a6c7-3a44-4abb-8e7a-5e3ca3d3e1d2n%40googlegroups.com.

Reply via email to