[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-04 Thread Eric Smith

Eric Smith  added the comment:

See the discussion on python-dev, in particular Martin's comment at
http://mail.python.org/pipermail/python-dev/2009-December/094412.html

The solutions to this seem too complex for 2.x. It is not a problem in 3.x.

--
resolution:  -> wont fix
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-03 Thread Eric Smith

Eric Smith  added the comment:

I've raised the issue with unicode and locale on python-dev:
http://mail.python.org/pipermail/python-dev/2009-December/094408.html

Pending the outcome of that decision, I'll move forward on this issue.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-03 Thread Mark Dickinson

Mark Dickinson  added the comment:

Reassigning to Eric.

--
assignee: mark.dickinson -> eric.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-02 Thread Stefan Krah

Stefan Krah  added the comment:

Googling "multi-byte thousands separator" gives better results. From
those results, it is clear to me that decimal_point and thousands_sep
are strings that may be interpreted as multi-byte characters. The Czech
separator appears to be a no-break space multi-byte character.


http://sourceware.org/ml/libc-hacker/2007-01/msg5.html
http://drupal.org/node/353897


My point is that if a multi-byte character appears, it should be
counted as a single character for the purposes of calculating
min-width. Otherwise, the printed representation is too short.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-02 Thread Eric Smith

Eric Smith  added the comment:

In trunk, Modules/_localemodule.c also treats these as "string of char",
so at least we're consistent.

In py3k, mbstowcs is used and the result passed to PyUnicode_FromWideChar.

I'm not sure how you'd address this in locale in trunk, or if we want to
do something similar in localeutil.h in trunk (for the Unicode case).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-02 Thread Eric Smith

Eric Smith  added the comment:

I don't see any documentation that a struct lconv should be interpreted
as UTF-8. In fact Googling "struct lconv utf-8" gives this bug report as
the first hit.

lconv.thousands_sep is char*. It's never been clear to me if this means
"pointer to a single char", or "pointer to a null terminated string of
char". In Objects/stringlib/localeutil.h I treat it as a string of char.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-02 Thread Mark Dickinson

Mark Dickinson  added the comment:

So when the format string has type 'str' (as in Stefan's original example) 
rather than type 'unicode', I'd say Python is doing the right thing 
already:  everything in sight, including the separators coming from 
localeconv(), has type 'str', so trying to interpret things as unicode 
seems a bit of a stretch.

If the '\xc2\xa0' from localeconv()['thousands_sep'] is to be interpreted 
as a single unicode character, shouldn't it be a unicode
string already?

However, if localeconv()['thousands_sep'] *were* to give a unicode string, 
then I suppose Decimal.__format__ should be returning a unicode result;  I 
don't think it currently does this.  (Should this be true even if the 
number being formatted is so short that no thousands separators actually 
appear in it?)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-02 Thread Stefan Krah

Stefan Krah  added the comment:

In python3.2, the output of decimal looks good. With float, the
separator is printed as two spaces on my Unicode terminal (export
LC_ALL=cs_CZ.UTF-8).

So decimal (3.2) interprets the separator string as a single UTF-8 char
and the final output is a UTF-8 string. I'd say that in C, this is the
intended way of using struct lconv.

If there is an agreement that the final output should be a UTF-8 string,
this looks correct to me.



Python 3.2a0 (py3k:76081M, Nov  6 2009, 15:23:48) 
[GCC 4.1.3 20080623 (prerelease) (Ubuntu 4.1.2-23ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale, decimal
>>> locale.setlocale(locale.LC_NUMERIC, 'cs_CZ.UTF-8')
'cs_CZ.UTF-8'
>>> x = format(decimal.Decimal("-1.5"),  '019.18n')
>>> y = format(float("-1.5"),  '019.18n')
>>> x
'-0\xa\xa\xa\xa0001,5'
>>> y
'-0ᅡᅠ000ᅡᅠ000ᅡᅠ001,5'
>>> print(x)
-0 000 000 000 001,5
>>> print(y)
-0ᅡᅠ000ᅡᅠ000ᅡᅠ001,5
>>>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-01 Thread Eric Smith

Eric Smith  added the comment:

I can duplicate this on Linux. The difference is the values in the
locale for the separators, specifically,
locale.localeconv()['thousands_sep'].

>>> locale.localeconv()['thousands_sep']
'\xc2\xa0'

The question is: since a struct lconv contains char*s, how to interpret
them? The code in decimal interprets them as ascii, apparently. floats
do the same thing, so this isn't strictly a decimal problem. I'll have
to give it some thought.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-01 Thread R. David Murray

R. David Murray  added the comment:

Interesting.  My regular locale is LC_CTYPE=en_US.UTF-8, and here is
what I get:

Python 2.7a0 (trunk:76501, Nov 24 2009, 13:59:01) 
[GCC 4.4.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import local
>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> from decimal import Decimal
>>> s = format(Decimal("-1.5"),  ' 019.18n')
>>> s
'-0\xc2\xa\xc2\xa\xc2\xa0001,5'
>>> len(s)
19
>>> print s
-0 000 000 001,5

sys.stdout.encoding gives 'UTF-8'.

And here's the traceback from trying to use unicode:

>>> s = format(Decimal("-1.5"),  u' 019.18n')
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/rdmurray/python/trunk/Lib/decimal.py", line 3609, in
__format__
return _format_number(self._sign, intpart, fracpart, exp, spec)
  File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5704, in
_format_number
return _format_align(sign, intpart+fracpart, spec)
  File "/home/rdmurray/python/trunk/Lib/decimal.py", line 5595, in
_format_align
result = unicode(result)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2:
ordinal not in range(128)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-01 Thread Eric Smith

Eric Smith  added the comment:

In 2.7, I get:

$ ./python.exe 
Python 2.7a0 (trunk:76501, Nov 24 2009, 14:57:21) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> from decimal import Decimal
>>> s = format(Decimal("-1.5"),  ' 019.18n')
>>> s
'-0 000 000 000 001,5'
>>> len(s)
20
>>> s = format(Decimal("-1.5"),  u' 019.18n')   
>>> s
u'-0 000 000 000 001,5'
>>> len(s)
20
>>> 

Could you give more details on the UnicodeDecodeError you get? Any
traceback?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-12-01 Thread R. David Murray

R. David Murray  added the comment:

In python3:

>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> s = format(Decimal("-1.5"),  ' 019.18n')
>>> len(s)
20
>>> print(s)
-0 000 000 000 001,5

Python3 uses unicode for strings.  Python2 uses bytes.  To format
unicode in python2, you do:

>>> s2 = locale.format("% 019.18g", Decimal("-1.5"))
>>> len(s2)
19
>>> print s2
-0001,5

Not quite the same thing, clearly.  So, is there a way to access the
python3 unicode format semantics in python2?  Just passing format a
unicode format string results in a UnicodeDecodeError.

--
nosy: +r.david.murray
priority:  -> normal
type:  -> behavior
versions: +Python 2.6, Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-11-30 Thread Stefan Krah

Stefan Krah  added the comment:

What you mean by "working with bytestrings"? The UTF-8 separators or
decimal points come directly from struct lconv (man localeconv). The
logical way to reach a minimum width of 19 is to have 19 UTF-8
characters, which can subsequently be converted to other formats.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-11-28 Thread Matthew Barnett

Matthew Barnett  added the comment:

Surely this is to be expected when working with bytestrings. You should
be working in Unicode and using UTF-8 only for input and output.

--
nosy: +mrabarnett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-11-28 Thread Mark Dickinson

Changes by Mark Dickinson :


--
assignee:  -> mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7327] format: minimum width: UTF-8 separators and decimal points

2009-11-15 Thread Stefan Krah

New submission from Stefan Krah :

This issue affects the format functions of float and decimal.

When calculating the padding necessary to reach the minimum width,
UTF-8 separators and decimal points are calculated by their byte
lengths. This can lead to printed representations that are too short.


Real world example (separator):

>>> import locale
>>> from decimal import *
>>> locale.setlocale(locale.LC_NUMERIC, "cs_CZ.UTF-8")
'cs_CZ.UTF-8'
>>> s = format(Decimal("-1.5"),  ' 019.18n')
>>> len(s)
19
>>> len(s.decode('utf-8'))
16
>>> s
'-0\xc2\xa\xc2\xa\xc2\xa0001,5'
>>> 
>>> 
>>> s = format(-1.5,  ' 019.18n')
>>> s
'-0\xc2\xa\xc2\xa\xc2\xa0001,5'
>>> len(s.decode('utf-8'))
16
>>> 


Constructed example (separator and decimal point):

>>> u = {'decimal_point' : "\xc2\xbf",  'grouping' : [3, 3, 0],
'thousands_sep': "\xc2\xb4"}
>>> def get_fmt(x, locale, fmt='n'):
... return Decimal.__format__(Decimal(x), fmt, _localeconv=locale)
... 
>>> s = get_fmt(Decimal("1.5"), u, "020n")
>>> s
'00\xc2\xb4000\xc2\xb4000\xc2\xb4001\xc2\xbf5'
>>> len(s.decode('utf-8'))
16

--
messages: 95283
nosy: eric.smith, mark.dickinson, skrah
severity: normal
status: open
title: format: minimum width: UTF-8 separators and decimal points

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com