Re: Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Tim Chase Mon, 31 Oct 2011 17:04:42 -0700

On 10/31/11 18:02, Steven D'Aprano wrote:

# Define legal characters:
LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
     # everybody forgets about formfeed... \f
     # and are you sure you want to include chr(127) as a text char?


def is_ascii_text(text):
     for c in text:
         if c not in LEGAL:
             return False
     return True


Algorithmically, that's as efficient as possible: there's no faster way
of performing the test, although one implementation may be faster or
slower than another. (PyPy is likely to be faster than CPython, for
example.)

Additionally, if one has some foreknowledge of the characterdistribution, one might be able to tweak your

def is_ascii_text(text):
     legal = frozenset(LEGAL)
     return all(c in legal for c in text)

with some if/else chain that might be faster than the hashinginvolved in a set lookup (emphasis on the *might*, not being anexpert on CPython internals) such as


  def is_ascii_text(text):
    return all(
      (' ' <= c <= '\x7a') or
      c == '\n' or
      c == '\t'
      for c in text)

But Steven's main points are all spot on: (1) use an O(1) lookup;(2) return at the first sign of trouble; and (3) push it into theC implementation rather than a for-loop. (and the "locals arefaster in CPython" is something I didn't know)


-tkc








--
http://mail.python.org/mailman/listinfo/python-list

Re: Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Reply via email to