Yeah. I haven't tried out your code, but an API idea might be to have the 
caller instruct the iterator what to do with any improper/undelimited record at 
the end -- includeUnterminated=False or something. Only the caller really knows 
what is desired.

One observation, algo performance-wise, is that the fastest substring searches 
are usually achieved by using a fast libc SSE/AVX vectorized `memchr` to find 
the _least likely_ character in the substring. Sometimes one knows this from 
vague properties of the distribution. For example, framing the email messages 
in UUCP-style mailbox with '^From ' kinds of delimiting, `memchr` for the 
capital 'F' skips over most text most of the time. This idea may not work in a 
corpus of emails in ALL CAPS ALL THE TIME, but such a corpus should be rare.  
Anyway, this idea is just sort of the simplest/first step into a wider world of 
efficient substring search algorithms that you might enjoy reading about some 
day.

Reply via email to