On 10/31/2011 3:54 PM, pyt...@bdurham.com wrote:
Wondering if there's a fast/efficient built-in way to determine if a
string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or Tab?

I presume you also want to disallow the other ascii control chars?

I know I can look at the chars of a string individually and compare them
against a set of legal chars using standard Python code (and this works

If, by 'string', you mean a string of bytes 0-255, then I would, in Python 3, where bytes contain ints in [0,255], make a byte mask of 256 0s and 1s (not '0's and '1's). Example:

mask = b'\0\1'*121
for c in b'\0\1help': print(mask[c])

1
0
1
0
1
1

In your case, use \1 for forbidden and replace the print with "if mask[c]: <found illegal>; break"

In 2.x, where iterating byte strings gives length 1 byte strings, you would need ord(c) as the index, which is much slower.

fine), but I will be working with some very large files in the 100's Gb
to several Tb size range so I'd thought I'd check to see if there was a
built-in in C that might handle this type of check more efficiently.
Does this sound like a use case for cython or pypy?

Cython should get close to c speed, especially with hints. Make sure you compile something like the above as Py 3 code.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to