One way to accomplish that is to set up a text factory with three Canonize 
steps. 

The first Canonize step, say a DoublesWordsSanitizerIgnores.txt file, will 
have the one or more patterns that you want to exclude from the double word 
finding and fixing step and supplement them with a fix-up, ignore add-on 
pattern.

The second Canonize step, say a DoublesWordsSanitizer.txt file, will have 
the one or more patterns that find the double words you want to replace 
with single words and doesn't find the fixed-up excludes from the first 
Canonize step.

The third and final Canonize step, say a DoubleWordSanitizerCleanup.txt 
file, will have the pattern to find all the fixed-up ignores/excludes from 
the first Canonize step and remove the fix-up add-ons, returning those bits 
of text back to the original.

For the first Canonize list of ignore/exclude patterns, it would probably 
be best to start out with a fairly simple list of double word patterns. 
Starting out simple helps in debugging and easily understanding what you're 
ignoring/excluding. Something simple like:

(\s)(many,\smany)(\s) \1%%\2%%\3
(\s)(very,\svery)(\s) \1%%\2%%\3

where I'm using %% as the fix-up add-ons to ignore/exclude those double 
word occurrences from the next double word sanitizing step.

(I'm capturing the leading and trailing white space to handle edge cases 
like line feeds, and I'm not using any word boundaries, \b, to avoid having 
to handle non-word character letters concatenated with word characters.)

For the second Canonize list of patterns to find double words and replace 
with single words, one or more grep patterns like:

(\s)(\w+)\s\2 \1\2
(\s)(\w+),\s\2 \1\2

For the last and final Canonize step to clean up the fix-up add-ons added 
in the first ignore/exclude step, a grep pattern like:

%%(.+)%% \1

Suggest you start out with sample text, and with that sample text, 
individually run each Canonize step (from Text -> Canonize...) on the 
sample text to check the patterns you have in the specific canon file do 
what they're supposed to do. Then, after that, combine them into a text 
factory and recheck the combined operation.

On Monday, November 10, 2025 at 9:58:50 AM UTC-8 GWied wrote:

> Another "occasional user" question - with not enough time to learn all the 
> cool tools bbedit and regex offer that would solve my problem. 
> context: I have about 500 KB of text (as 43 .txt files) video transcripts 
> to edit/refine. All files have timecode removed.
>
> I want to find all instances of doubled words, but omit/ignore a subset of 
> those matches, i.e., search for doubled words in a video transcript, but 
> *EXCLUDE 
> "many, many" and "very, very"*. In effect, this will reduce instances of 
> stuttering in a video transcript, but leave the intentional repeats intact. 
>
> This search string finds doubled words separated by a comma and a space, 
> which satisfies most of the instances of doubled words:
> (\b[A-Za-z]+\b),\s\1 
>
> replace with
> \1 
> e.g., find "*what, what*" and replace with "*what*" in the string:
> *So when we talk about the structure of data that describes what, what 
> identifies our columns*
>
> But do not replace "*very, very*" in the string:
> *or even the greater distance away from zero, is **very, very **small. *
>
>
> Thank you for any hints on doing this.
>
> Glenn
>
>  
>

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or believe that the application isn't working correctly, please email 
"[email protected]" rather than posting here. Follow @bbedit on Mastodon: 
<https://mastodon.social/@bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/bbedit/16e55a96-d19d-4f86-b8cc-17e77b192212n%40googlegroups.com.

Reply via email to