The online demo you linked to, Sam, does exactly what I need! I've found a 
few other doublets in the file that clearly should not be there. Thank you 
all for your thoughts and inputs.

On Monday, April 25, 2022 at 8:38:09 PM UTC+2 Sam Hathaway wrote:

> This sounds like a case of the Longest repeated substring problem 
> <https://en.wikipedia.org/wiki/Longest_repeated_substring_problem>. 
> Regular expressions are not the right tool for the job, unfortunately.
>
> There’s an online demo 
> <https://daniel-hug.github.io/longest-repeated-substring/> that might do 
> what you need.
>
> If you want to find all long repeated substrings, you can take an 
> iterative approach: find the longest, remove the duplicates from the source 
> text, and again find the longest.
>
> Hope this helps,
> -sam
>
> On 25 Apr 2022, at 11:42, samar wrote:
>
> Hi all
>
> While copyediting a text for a scholarly book (500+ pages when printed), I 
> noticed that the author wrote exactly the same long sentence (= an 
> identical string of 337 characters) once on page 23 and once on page 326. 
> No doubt this happened because the author copied and pasted some text from 
> his notes, unaware that he had already copied and pasted the same text 
> earlier. I thought it would be a good idea to find out whether this has 
> happened to the author more than one time in his 1,000,000-character book, 
> so that I can alert him (to give him a chance to omit the repetition).
>
> And so I turned to BBEdit. The text of the whole book is now in a txt 
> file. When I search for the sentence that in the Word document is on page 
> 23, I can find it in BBEdit both in paragraph 117 and in paragraph 7831. 
> What regular expression can I use to find other such repetitions?
>
> I tried using the following string:
>
> (?s)(.{200}).*?\1
>
> This is what I understand it to mean (roughly):
>
> (?s): search across paragraphs
> (.{200}).*?: search for, and capture, a string of 200 characters, 
> optionally followed by any characters
> \1: stop the search as soon as you reach a second instance of the captured 
> string
>
> The string does what I need if I replace 200 with a shorter number, such 
> as 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of 
> course). Given that the sentence I have in mind is more than 300 characters 
> long I should even have been able to use 300 instead of just 200.
>
> Unfortunately, however, something seems to be amiss: BBEdit kept on 
> searching and searching, without finding anything, and my notebook started 
> fanning, and after about 20 minutes it became clear that nothing would 
> happen, and that I cannot do anything else but to Force Quit BBEdit.
>
> So my question is, what's wrong with the above string? How else can I find 
> a repeated 200-character sentence in a large text file?
>
> Thanks
> Sam
>
> --
> This is the BBEdit Talk public discussion group. If you have a feature 
> request or need technical support, please email "sup...@barebones.com" 
> rather than posting here. Follow @bbedit on Twitter: <
> https://twitter.com/bbedit>
> ---
> You received this message because you are subscribed to the Google Groups 
> "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to bbedit+un...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or need technical support, please email "supp...@barebones.com" rather than 
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to bbedit+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bbedit/2862cecc-4ce5-4375-bcab-c488abb8e026n%40googlegroups.com.

Reply via email to