Re: Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Terry Reedy Mon, 31 Oct 2011 16:15:50 -0700

On 10/31/2011 3:54 PM, [email protected] wrote:

Wondering if there's a fast/efficient built-in way to determine if a
string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or Tab?


I presume you also want to disallow the other ascii control chars?

I know I can look at the chars of a string individually and compare them
against a set of legal chars using standard Python code (and this works

If, by 'string', you mean a string of bytes 0-255, then I would, inPython 3, where bytes contain ints in [0,255], make a byte mask of 2560s and 1s (not '0's and '1's). Example:


mask = b'\0\1'*121
for c in b'\0\1help': print(mask[c])

1
0
1
0
1
1

In your case, use \1 for forbidden and replace the print with "ifmask[c]: <found illegal>; break"

In 2.x, where iterating byte strings gives length 1 byte strings, youwould need ord(c) as the index, which is much slower.

fine), but I will be working with some very large files in the 100's Gb
to several Tb size range so I'd thought I'd check to see if there was a
built-in in C that might handle this type of check more efficiently.
Does this sound like a use case for cython or pypy?

Cython should get close to c speed, especially with hints. Make sure youcompile something like the above as Py 3 code.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Reply via email to