Hi all, Apologies, I didn't realize my correspondence with Bruce didn't get added to the whole thread! My use case for this grep search was that I am transferring text from PDF to .txt to do text mining. In transferring from PDF to .txt many of the files were duplicating words, as an example:
Here isis anan exampleexample of the way inin whichwhich somesome of the texttext wouldwould be transferred to .txt I was trying to use a grep search to find the duplicated words and replace them with just a single instance of that word, meaning: take "exampleexample" and change it to "example" Bruce's solution, '\b(\w+)\1\b’ did end up working for me. I had also contacted BBedit's help service, and they said that the error code 12247 was a "match limit exceeded" error and suspected it was due to the number of instances the original grep search was finding. thanks all! On Sunday, December 15, 2024 at 9:35:02 PM UTC-5 GP wrote: > Yes, that's the way the original '\b(\w+)+\1\b’ works in practice. The > first zero or more word characters aren't captured but are included in the > match so they get deleted in the replacement of using just the \1 capture > group. Since \w* and (?:\w+)* are match equivalent in practice, perhaps the > expression '\b(?:\w+)*(\w+)\1\b’ will better explain how it is just the > last iteration match of (\w+) of the (\w+)+\1 expression that is captured > and how any and all of the preceding groupings of \w+ matches, if any, are > discarded as captures and aren't included in the one, final capture group. > > Take for example the word facilisis. The regular expression engine ends up > finding a leading match on - facil - a group 1 capturing match on - is - > and a non-capturing match to capture group 1 - is . The whole matched word > string then gets replaced by just the capture group 1 string of 'is' > (without the quotes). Your guess is as good as mine as to how much string > slicing and dicing; capturing and capture discarding the engine is > performing before arriving at that match and capture group solution. > > That said, I flubbed the copy and paste in the last of that comment > discussing backtracking. I intended to use '\b(\w+)+\1\b’ for the > backtracking comment part but instead copied and pasted '\b\w*(\w+)\1\b’. > As it turns out both have a whole lot of backtracking but '\b\w*(\w+)\1\b’ > has slightly less backtracking than '\b(\w+)+\1\b’ on the example search > text I was using. > > On Sunday, December 15, 2024 at 4:14:09 PM UTC-8 Bruce Van Allen wrote: > >> Thanks for digging into the regex meaning of that second ‘+’ in >> '\b(\w+)+\b’. >> >> As it turned out, the OP needed to find repeated words, not characters, >> so inserting a spacebar space for the second plus sign totally works for >> them. >> >> Also, I’m not sure you’re suggesting this but at the end of your comment >> you’re talking about the pattern '\b\w*(\w+)\1\b’. That first zero or more >> word characters - \w* - won’t be captured and so won’t be in the >> replacement pattern. Is that what you meant? >> >> Best, >> >> — Bruce >> >> _bruce__van_allen__santa_cruz_ca_ >> >> >> > On Dec 15, 2024, at 3:50 PM, GP <[email protected]> wrote: >> > >> > First with BBEdit 15.1.3 (15B62, Apple Silicon) I didn't get any error >> with ce gm's grep find and replace. >> > >> > That said, however, I found the second + is doing something in the find >> and replace operation. >> > >> > Using Howard's posted sample records test from the "Sorting multiple >> records in a text file" for testing text. Using the Pattern Playground with >> the find: '\b(\w+)+\1\b’ (without the quotes) and replace: \1 pattern, 7 >> matches were found: >> > 0 -> facilisis >> > 1 -> is >> > replacement -> is >> > >> > 0 -> Underhill >> > 1 -> l >> > replacement -> l >> > >> > 0 -> 11 >> > 1 -> 1 >> > replacement -> 1 >> > >> > 0 -> Afterall >> > 1 -> l >> > replacement -> l >> > >> > 0 -> 11 >> > 1 -> 1 >> > replacement -> 1 >> > >> > 0 -> 22 >> > 1 -> 2 >> > replacement -> 2 >> > >> > 0 -> Afterall >> > 1 -> l >> > replacement -> l >> > >> > whereas, with the find: '\b(\w+)\1\b’ (without the second + and without >> the quotes) and same replace pattern, only 3 matches were found: >> > 0 -> 11 >> > 1 -> 1 >> > replacement -> 1 >> > >> > 0 -> 11 >> > 1 -> 1 >> > replacement -> 1 >> > >> > 0 -> 22 >> > 1 -> 2 >> > replacement -> 2 >> > >> > According to https://regex101.com's explanation, the difference is due >> to the capturing group workings of the (\w+)+ part of the regular >> expression: "A repeated capturing group will only capture the last >> iteration." So, if I'm not mistaken, the workings of (\w+)+ is equivalent >> to \w*(\w+) and the equivalent find grep is \b\w*(\w+)\1\b . That would >> match any word string containing zero or more word characters followed by a >> capturing group of one or more word characters followed by a single repeat >> of the captured group of characters. According to regex101.com's Regex >> Debugger there's a whole lot of backtracking going on to find all the >> matches with the \b\w*(\w+)\1\b grep. >> > On Saturday, December 14, 2024 at 3:07:35 PM UTC-8 Bruce Van Allen >> wrote: >> > Hi, >> > >> > An example of the text and a description of what you’re trying to >> accomplish would help. >> > >> > From your find pattern, I’m guessing you’re trying to find cases where >> a string is followed by the same string, to be replaced by just one >> instance of the string. >> > >> > '\b(\w+)+\1\b’ (your original - without the quotes) >> > >> > Your find pattern’s second plus sign ‘+’ isn’t doing anything, because >> the first one, which quantifies the ‘\w’, is grabbing every consecutive >> word/alphanumeric character including any repetitions. >> > >> > Removing that second ‘+', the find pattern '\b(\w+)\1\b’ (without the >> quotes) will find a string of word characters followed immediately by the >> same string, as in ‘My sentence is abcabc for defdef.’ Using your >> replacement pattern of ‘\1’, this will become ‘My sentence is abc for def.’ >> > >> > Guessing that you’re are actually looking for duplicated WORDS, if the >> find pattern has a spacebar space ‘ ‘ then it will find any word followed >> by a space and then the same exact word, and the replacement will eliminate >> the duplication. >> > >> > With find pattern '\b(\w+) \1\b’, your replacement pattern makes 'My >> sentence is abc abc for def def.’ into 'My sentence is abc for def.’ >> > >> > If you want to find a string of word characters that matches an earlier >> instance of the same string but separated by more than just a space, your >> pattern may be more complicated. >> > >> > HTH and please clarify if my guesses are wrong. >> > >> > — Bruce >> > >> > _bruce__van_allen__santa_cruz_ca_ >> > >> > >> > > On Dec 14, 2024, at 1:43 PM, ce gm <[email protected]> wrote: >> > > >> > > Hello there, >> > > >> > > I am doing a GREP search on a .txt file in Bbedit on my Mac. Here are >> the find/replace terms: >> > > Find: \b(\w+)+\1\b >> > > Replace: \1 >> > > >> > > When I input the Find term, it correctly identifies the targets in >> the preview (highlights them in yellow). Then, when I push Replace All, I >> get a pop up with Application Error Code: 12247 and nothing else. >> > > >> > > Anyone know what this means? A cursory Google search was not helpful. >> > > >> > > Thanks! >> > > >> > > -- >> > > This is the BBEdit Talk public discussion group. If you have a >> feature request or believe that the application isn't working correctly, >> please email "[email protected]" rather than posting here. Follow >> @bbedit on Mastodon: <https://mastodon.social/@bbedit> >> > > --- >> > > You received this message because you are subscribed to the Google >> Groups "BBEdit Talk" group. >> > > To unsubscribe from this group and stop receiving emails from it, >> send an email to [email protected]. >> > > To view this discussion visit >> https://groups.google.com/d/msgid/bbedit/c9e18d6f-f5c4-467e-9c01-fa4ffbaa5485n%40googlegroups.com. >> >> >> > >> > >> > -- >> > This is the BBEdit Talk public discussion group. If you have a feature >> request or believe that the application isn't working correctly, please >> email "[email protected]" rather than posting here. Follow @bbedit on >> Mastodon: <https://mastodon.social/@bbedit> >> > --- >> > You received this message because you are subscribed to the Google >> Groups "BBEdit Talk" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to [email protected]. >> > To view this discussion visit >> https://groups.google.com/d/msgid/bbedit/72b08e6c-5ac8-478c-8f54-9baddaeb18een%40googlegroups.com. >> >> >> >> -- This is the BBEdit Talk public discussion group. If you have a feature request or believe that the application isn't working correctly, please email "[email protected]" rather than posting here. Follow @bbedit on Mastodon: <https://mastodon.social/@bbedit> --- You received this message because you are subscribed to the Google Groups "BBEdit Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/bbedit/a267a6c7-3a44-4abb-8e7a-5e3ca3d3e1d2n%40googlegroups.com.
