On Mon, Oct 8, 2012 at 4:11 PM, Prasad, Ramit <ramit.pra...@jpmorgan.com> wrote:
>
>> for ch in text:
>>      if '0' <= ch <= '9':
>>          doSomething(ch)
>
> I am not sure that will work very well with Unicode numbers. I would
> assume (you know what they say about assuming) that str.isdigit()
> works better with international characters/numbers.

In my tests below, isdigit() matches both decimal digits ('Nd') and
other digits ('No'). None of the 'No' category digits works with
int().

Python 2.7.3

    >>> chars = [unichr(i) for i in xrange(sys.maxunicode + 1)]
    >>> digits = [c for c in chars if c.isdigit()]
    >>> digits_d = [d for d in digits if category(d) == 'Nd']
    >>> digits_o = [d for d in digits if category(d) == 'No']
    >>> len(digits), len(digits_d), len(digits_o)
    (529, 411, 118)

Decimal

    >>> nums = [int(d) for d in digits_d]
    >>> [nums.count(i) for i in range(10)]
    [41, 42, 41, 41, 41, 41, 41, 41, 41, 41]

Other

    >>> print u''.join(digits_o[:3] + digits_o[12:56])
    ²³¹⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉①②③④⑤⑥⑦⑧⑨⑴⑵⑶⑷⑸⑹⑺⑻⑼⒈⒉⒊⒋⒌⒍⒎⒏⒐
    >>> print u''.join(digits_o[67:94])
    ❶❷❸❹❺❻❼❽❾➀➁➂➃➄➅➆➇➈➊➋➌➍➎➏➐➑➒
    >>> print u''.join(digits_o[3:12])
    ፩፪፫፬፭፮፯፰፱

    >>> int(digits_o[67])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'decimal' codec can't encode character
    u'\u2776' in position 0: invalid decimal Unicode string


Python 3.2.3

    >>> chars = [chr(i) for i in range(sys.maxunicode + 1)]
    >>> digits = [c for c in chars if c.isdigit()]
    >>> digits_d = [d for d in digits if category(d) == 'Nd']
    >>> digits_o = [d for d in digits if category(d) == 'No']
    >>> len(digits), len(digits_d), len(digits_o)
    (548, 420, 128)

Decimal

    >>> nums = [int(d) for d in digits_d]
    >>> [nums.count(i) for i in range(10)]
    [42, 42, 42, 42, 42, 42, 42, 42, 42, 42]

Other

    >>> print(*(digits_o[:3] + digits_o[13:57]), sep='')
    ²³¹⁰⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉①②③④⑤⑥⑦⑧⑨⑴⑵⑶⑷⑸⑹⑺⑻⑼⒈⒉⒊⒋⒌⒍⒎⒏⒐
    >>> print(*digits_o[68:95], sep='')
    ❶❷❸❹❺❻❼❽❾➀➁➂➃➄➅➆➇➈➊➋➌➍➎➏➐➑➒
    >>> print(*digits_o[3:12], sep='')
    ፩፪፫፬፭፮፯፰፱

    >>> int(digits_o[68])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: invalid literal for int() with base 10: '❶'
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to