That search pattern would start at the beginning of the text, grab the first 
200 characters and search the rest of the text for that, starting with the very 
next (201st) character, and trying every 200-char sequence from there to the 
end. Then it would progress one character forward to character 2, grab it and 
the next 199, and repeat that search. 

Moving ahead one character at a time through the million characters until it 
finds a match or until the text no longer has 200 characters left would 
certainly take some processing time!

Are the long strings always single sentences? If so, your pattern would be 
slightly optimized if didn’t accept the end of sentence character (“period”, 
“full stop”, “dot”).

(?s)([^.]+){200}.*\1

(Inside the character class brackets ‘.’ just means dot, not “any character”.)

Given that the 200 is an arbitrary parameter (that is, you’re not looking for a 
string you already know, exactly that length), the above does NOT have an end 
of sentence character.

Assuming standard English/European writing practice, the sentences could 
probably also be expected to start with an upper-case alpha character after a 
whitespace character, so the pattern would be faster as:

(?s)(\s[A-Z][^.]+){200.*\1

But the above suggestions won’t help much if you’re searching for strings with 
multiple sentences.

HTH

_bruce__van_allen__santa_cruz_ca_
_831_429_1688_p_
_831_332_3649_c_

> On Apr 25, 2022, at 8:42 AM, samar <arnet...@bluewin.ch> wrote:
> 
> Hi all
> 
> While copyediting a text for a scholarly book (500+ pages when printed), I 
> noticed that the author wrote exactly the same long sentence (= an identical 
> string of 337 characters) once on page 23 and once on page 326. No doubt this 
> happened because the author copied and pasted some text from his notes, 
> unaware that he had already copied and pasted the same text earlier. I 
> thought it would be a good idea to find out whether this has happened to the 
> author more than one time in his 1,000,000-character book, so that I can 
> alert him (to give him a chance to omit the repetition).
> 
> And so I turned to BBEdit. The text of the whole book is now in a txt file. 
> When I search for the sentence that in the Word document is on page 23, I can 
> find it in BBEdit both in paragraph 117 and in paragraph 7831. What regular 
> expression can I use to find other such repetitions?
> 
> I tried using the following string:
> 
> (?s)(.{200}).*?\1
> 
> This is what I understand it to mean (roughly):
> 
> (?s): search across paragraphs
> (.{200}).*?: search for, and capture, a string of 200 characters, optionally 
> followed by any characters
> \1: stop the search as soon as you reach a second instance of the captured 
> string
> 
> The string does what I need if I replace 200 with a shorter number, such as 
> 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of 
> course). Given that the sentence I have in mind is more than 300 characters 
> long I should even have been able to use 300 instead of just 200.
> 
> Unfortunately, however, something seems to be amiss: BBEdit kept on searching 
> and searching, without finding anything, and my notebook started fanning, and 
> after about 20 minutes it became clear that nothing would happen, and that I 
> cannot do anything else but to Force Quit BBEdit.
> 
> So my question is, what's wrong with the above string? How else can I find a 
> repeated 200-character sentence in a large text file?
> 
> Thanks
> Sam
> -- 
> This is the BBEdit Talk public discussion group. If you have a feature 
> request or need technical support, please email "supp...@barebones.com" 
> rather than posting here. Follow @bbedit on Twitter: 
> <https://twitter.com/bbedit>
> --- 
> You received this message because you are subscribed to the Google Groups 
> "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to bbedit+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com.

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or need technical support, please email "supp...@barebones.com" rather than 
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to bbedit+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bbedit/C59F264E-90CA-41DC-8B77-6DB3127ACAE2%40cruzio.com.

Reply via email to