On 10/31/2011 7:02 PM, Steven D'Aprano wrote:
On Mon, 31 Oct 2011 17:47:06 -0400, Dave Angel wrote:

On 10/31/2011 03:54 PM, pyt...@bdurham.com wrote:
Wondering if there's a fast/efficient built-in way to determine if a
string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or
Tab?

I know I can look at the chars of a string individually and compare
them against a set of legal chars using standard Python code (and this
works fine), but I will be working with some very large files in the
100's Gb to several Tb size range so I'd thought I'd check to see if
there was a built-in in C that might handle this type of check more
efficiently.

Does this sound like a use case for cython or pypy?

Thanks,
Malcolm

How about doing a .replace() method call, with all those characters
turning into '', and then see if there's anything left?


No offense Dave, but do you really think that making a copy of as much as
a terabyte of data is *more* efficient than merely scanning the data and
stopping on the first non-ASCII character you see?


There is no way of telling whether a string includes non-ASCII characters
without actually inspecting each character. So in the event that the
string *is* fully ASCII text, you have to check every character, there
can be no shortcuts. However, there is a shortcut if the string isn't
fully ASCII text: once you've found a single non-text character, stop. So
the absolute least amount of work you can do is:

# Define legal characters:
LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
     # everybody forgets about formfeed... \f
     # and are you sure you want to include chr(127) as a text char?

def is_ascii_text(text):
     for c in text:
         if c not in LEGAL:
             return False
     return True

If text is 3.x bytes, this does not work ;-).
OP did not specify bytes or unicode or Python version.


Algorithmically, that's as efficient as possible:

This is a bit strange since you go on to explain that it is inefficient -- O(n*k) where n = text length and k = legal length -- whereas below is O(n).

there's no faster way
of performing the test, although one implementation may be faster or
slower than another. (PyPy is likely to be faster than CPython, for
example.)

But can we get better in Python? Yes, I think so. First off, CPython is
optimized for local variable lookups over globals, and since you are
looking up the global LEGAL potentially 1000000000000 times, even a 1%
saving per lookup will help a lot. So the first step is to make a local
reference, by adding this line just above the for loop:

     legal = LEGAL

But we can do even better still. Each time we test for "c not in legal",
we do a linear search of 100 characters. On average, that will mean
comparing 50 characters for equality at best. We can do better by using a
set or frozenset, which gives us approximately constant time lookups:

     legal = frozenset(LEGAL)

Finally, we can try to do as much work as possible in fast C code and as
little as necessary in relatively slow Python:

def is_ascii_text(text):
     legal = frozenset(LEGAL)
     return all(c in legal for c in text)

Since all() is guaranteed to keep short-cut semantics, that will be as
fast as possible in Python,

A dangerous statement to make. 'c in legal' has to get hash(c) and look that up in the hash table, possible skipping around a bit if t If text is byte string rather than unicode, a simple lookup 'mask[c]', where mask is a 0-1 byte array, should be faster (see my other post). On my new Pentium Win 7 machine, it is -- by albout 5%. For 100,000,000 legal bytes, a minimum of 8.69 versus 9.17 seconds.

from time import clock
legal_set = frozenset(range(32, 128))
legal_ray = 128 * b'\1'
illegal = 128 * b'\0'  # only testing legal char 'a'
text = b'a' * 100000000
print(clock())
print(all(c in legal_set for c in text), clock())  # min 9.17
t = clock()
print(all(legal_ray[c] for c in text), clock()-t)  # min 8.69
##for c in text:
##    if illegal[c]: print(False); break  # slower, about 9.7
##print(True, clock())

The explicit loop took about 9.7 seconds. It is more flexible as it could detect the position of the first bad character, or of all of them.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to