[issue10581] Review and document string format accepted in numeric data type constructors

2014-10-14 Thread Stefan Krah

Changes by Stefan Krah stefan-use...@bytereef.org:


--
nosy:  -skrah

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-23 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

Martin v. Löwis wrote at #18236 (msg191687):
 int conversion ultimately uses Py_ISSPACE, which conceptually could
 deviate from the Unicode properties (as it is byte-based). This is not
 really an issue, since they indeed match.

Py_ISSPACE matches Unicode White_Space property in the ASII range (first 128 
code points) it differs for byte (code point) values from 128 through 255.  
This leads to the following discrepancy:

 int('123\xa0')
123

but

 int(b'123\xa0')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 3: invalid 
start byte
 int('123\xa0'.encode())
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: invalid literal for int() with base 10: '123\xa0'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-23 Thread Alexander Belopolsky

Changes by Alexander Belopolsky alexander.belopol...@gmail.com:


--
nosy: +loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-23 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

For the last discrepancy see issue16741. It have a patch which should fix this.

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-16 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

I took another look at the library reference and it looks like when it comes to 
non-ascii digits support, the reference contradicts itself.  On one hand,


int(x, base=10)

If x is not a number or if base is given, then x must be a string, bytes, or 
bytearray instance representing an integer literal in radix base. Optionally, 
the literal can be preceded by + or - (with no space in between) and surrounded 
by whitespace.
 http://docs.python.org/3/library/functions.html#int

.. suggests that only an integer literal will be accepted by int(), but on 
the other hand, a note in the Numeric Types section says: The numeric 
literals accepted include the digits 0 to 9 or any Unicode equivalent (code 
points with the Nd property). 
http://docs.python.org/3/library/stdtypes.html#typesnumeric

It also appears that surrounded by whitespace part is not entirely correct:

 '\N{RS}'.isspace()
True
 int('123\N{RS}')
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: invalid literal for int() with base 10: '123\x1e'

This is probably a bug in the current implementation and I will open a separate 
issue for that.

--
versions: +Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-16 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

i opened issue18236 to address the issue of surrounding whitespace.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-16 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

I have started a rough prototype for what I plan to eventually reimplement in C 
and propose as a patch here.

https://bitbucket.org/alexander_belopolsky/misc/src/c175171cc76e/utoi.py?at=master

Comments welcome.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-14 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 14.06.2013 03:43, Alexander Belopolsky wrote:
 
 Alexander Belopolsky added the comment:
 
 PEP 393 implementation has already added the fast path to decimal encoding:
 
 http://hg.python.org/cpython/diff/8beaa9a37387/Objects/unicodeobject.c#l1.3735
 
 What we can do, however, is improve performance of converting non-ascii 
 numerals by looking up only the first digit's value and converting the rest 
 using simple:
 
 value = code - (first_code - first_value)
 if not 0 = value  10:
raise or fall back to UCD lookup

I'm not sure whether just relying on PEP 393 is good enough.

Of course, you can special case the conversion based on the
kind, but that's only one form of optimization.

Slicing operations don't recheck the max code point
used in the substring. As a result, a slice may very well
be of the UCS2 kind, even though the text itself is ASCII.

Apart from the fast-path based on the string kind,
I think the decimal encoder would also have to scan the
string for non-ASCII code points. If it finds non-ASCII
code points, it would have to call the normalizer and
restart the scan based on the normalized string.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 14 2013)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2013-07-01: EuroPython 2013, Florence, Italy ...   17 days to go
2013-07-16: Python Meeting Duesseldorf ... 32 days to go

: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-13 Thread Nick Coghlan

Nick Coghlan added the comment:

I think PEP 393 gives us a quick way to fast parsing: if the max char is  128, 
just roll straight into normal processing, otherwise do the normalisation and 
all decimal digits are from the same script steps.

There are almost certainly better ways to do the script translation, but the 
example below tries to just do the convert to ASCII step to avoid duplicating 
the +/- and decimal point processing logic:

if max_char(arg) = 128:
arg = toNFKC(arg)
originals = set()
converted = []
for c in arg:
try:
d = str(unicodedata.decimal(c))
except ValueError:
d = c
else:
originals.add(c)
converted.append(d)
if (max(originals) - min(originals)) = 10:
raise ValueError(%s mixes digits from multiple scripts % arg)
arg = .join(converted)
result = parse_ascii_number(arg)


P.S. I don't think the base argument is especially applicable ('0x' is rejected 
because 'x' is not a base 10 digit and we allow a base of '0' to request use 
int literal base markers).

--
nosy: +ncoghlan

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-13 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

PEP 393 implementation has already added the fast path to decimal encoding:

http://hg.python.org/cpython/diff/8beaa9a37387/Objects/unicodeobject.c#l1.3735

What we can do, however, is improve performance of converting non-ascii 
numerals by looking up only the first digit's value and converting the rest 
using simple:

value = code - (first_code - first_value)
if not 0 = value  10:
   raise or fall back to UCD lookup

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-12 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 12.06.2013 07:32, Alexander Belopolsky wrote:
 
 Alexander Belopolsky added the comment:
 
 It looks like we a approaching consensus on some points:
 
 1. Mixed script numerals should be disallowed.
 2. '\N{MINUS SIGN}' should be accepted as an alternative to '\N{HYPHEN-MINUS}'
 
 Open question: should we accept fullwidth + and -, sub/superscript variants 
 etc.?  I believe rather than debating variant codepoints one by one, we 
 should consider applying NFKC (compatibility) normalization to unicode 
 strings to be interpreted as numbers.  This would allow parsing strings like 
 this:
 
 float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL 
 STOP}\N{FULLWIDTH DIGIT TWO}'))
 -1.2

While it would solve these cases, I think that would cause a
significant performance hit.

Perhaps we could do this in two phases:
1. detect whether the string uses non-ASCII digits and symbols
2. if it does, apply normalization and then use the decimal codec

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

I've changed my mind :-)

Restricting the decimal encoder to only accept code points from one of the 
possible decimal digit ranges is a good idea. Let's do that.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-11 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

It looks like we a approaching consensus on some points:

1. Mixed script numerals should be disallowed.
2. '\N{MINUS SIGN}' should be accepted as an alternative to '\N{HYPHEN-MINUS}'

Open question: should we accept fullwidth + and -, sub/superscript variants 
etc.?  I believe rather than debating variant codepoints one by one, we should 
consider applying NFKC (compatibility) normalization to unicode strings to be 
interpreted as numbers.  This would allow parsing strings like this:

 float(normalize('NFKC', '\N{FULLWIDTH HYPHEN-MINUS}\N{DIGIT ONE FULL 
 STOP}\N{FULLWIDTH DIGIT TWO}'))
-1.2

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2013-06-10 Thread Chris Rebert

Changes by Chris Rebert pyb...@rebertia.com:


--
nosy: +cvrebert

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2011-05-14 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

 I find it convenient to use int(), float() etc. for data validation.

Me too.  This is why I'd still be happiest with int and float not accepting 
non-ASCII digits at all.  (And also why the recent suggestions to allow extra 
underscores in int and float input make me uneasy.)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2011-05-08 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Sat, May 7, 2011 at 11:25 AM, Éric Araujo rep...@bugs.python.org wrote:
 .. On one hand, I tend to agree that mixing Hindi/Arab numerals with Bengali 
 does not make sense;
 on the other hand, rejecting it means that the int code does know about 
 Unicode, which you argued
 against.

In order to flag use of mixed scripts in numerals, the code does not
require access to any additional unicode data.  Since Unicode 6.0.0,
programmers can rely on the following stability promise:


Characters with the property value Numeric_Type=de (Decimal) only
occur in contiguous ranges of 10 characters, with ascending numeric
values from 0 to 9 (Numeric_Value=0..9).
  -- http://www.unicode.org/policies/stability_policy.html

Therefore, the validation code can simply check that for all digits in
the number, ord(d) - unicodedata.numeric(d) is the same.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2011-05-07 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

 I may be in minority, but I find it convenient to use int(), float()
 etc. for data validation.
A number of libraries agree: argparse, HTML form handling libs, etc.

 I may be too strict, but I don't think anyone would want to see
 columns with a mix of Bengali and Devanagari numerals. [...]
 On the other hand there is certain convenience in promiscuous
 parsers, but this is not an expectation that I have from int() and
 friends. [...] There are pros and cons in any approach.
Indeed, tough question.  On one hand, I tend to agree that mixing Hindi/Arab 
numerals with Bengali does not make sense; on the other hand, rejecting it 
means that the int code does know about Unicode, which you argued against.

[MAL]
 The codecs, Unicode methods and other Unicode support features
 happily work with all kinds of languages, mixed or not, without any
 such specification.
 In my view int() and friends are only marginally related to Unicode
 and Unicode methods design is not directly relevant to their behavior.
I think I agree.  It’s perfectly fine that Unicode support features don’t care 
about the type of the characters but just encode and decode; however, int has a 
validation step.  It rejects numerals that don’t make sense with the given base 
for example, so rejecting nonsensical sequences of Unicode numerals makes sense 
IMO.

What do the other languages that are able to convert from Unicode numerals to 
integer objects?

--
nosy: +eric.araujo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2010-11-29 Thread Alexander Belopolsky

New submission from Alexander Belopolsky belopol...@users.sourceforge.net:

I am opening a new report to continue work on the issues raised in #10557 that 
are either feature requests or documentation bugs.

The rest is my reply to the relevant portions of Marc's comment at msg122785.

On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg rep...@bugs.python.org 
wrote:
..
 Alexander Belopolsky wrote:

 Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

 After a bit of svn archeology, it does appear that Arabic-Indic
 digits' support was deliberate at least in the sense that the
 feature was tested for when the code was first committed. See r15000.

 As I mentioned on python-dev 
 (http://mail.python.org/pipermail/python-dev/2010-November/106077.html)
 this support was added intentionally.

 The test migrated from file to file over the last 10 years, but it
 is still present in test_float.py:

         self.assertEqual(float(b  \u0663.\u0661\u0664  
 .decode('raw-unicode-escape')), 3.14)

 (It should probably be now rewritten using a string literal.)

..
 For the future, I note that starting with Unicode 6.0.0,
 the Unicode Consortium promises that

 
 Characters with the property value Numeric_Type=de (Decimal) only
 occur in contiguous ranges of 10 characters, with ascending numeric
 values from 0 to 9 (Numeric_Value=0..9).
 

 This makes it very easy to check a numeric string does not contain
 a mix of digits from different scripts.

 I'm not sure why you'd want to check for such ranges.


In order to disallow a mix of say Arabic-Indic and Bengali digits.  Such 
combinations cannot be defended as possibly valid numbers in any script.

 I still believe that proper API should require explicit choice of
 language or locale before allowing digits other than 0-9 just as
 int() would not accept hexadecimal digits without explicit choice of
 base = 16.  But this would be a subject of a feature request.

 Since when do we require a locale or language to be specified when
 using Unicode ?


This is a valid question.  I may be in minority, but I find it convenient to 
use int(), float() etc. for data validation.  If my program gets a CSV file 
with Arabic-Indic digits, I want to fire the guy who prepared it before it 
causes real issues. :-)  I may be too strict, but I don't think anyone would 
want to see columns with a mix of Bengali and Devanagari numerals.

On the other hand there is certain convenience in promiscuous parsers, but this 
is not an expectation that I have from int() and friends.  int('0xFF') requires 
me to specify base even though 0xFF is a perfectly valid notation.

There are pros and cons in any approach.  Let's figure out what is better 
before we fix the documentation.

 The codecs, Unicode methods and other Unicode support features
 happily work with all kinds of languages, mixed or not, without any
 such specification.

In my view int() and friends are only marginally related to Unicode and Unicode 
methods design is not directly relevant to their behavior.  If we were 
designing str.todigits(), by all means, I would argue that it must be 
consistent with str.isdigit().  For numeric data, however, I think we should 
follow the logic that rejected int('0xFF').

This is my opinion.  We can consider allowing int('0xFF') as well.  Let's 
discuss.

--
assignee: belopolsky
components: Documentation, Interpreter Core
messages: 122834
nosy: belopolsky, eric.smith, ezio.melotti, haypo, lemburg, mark.dickinson, 
skrah
priority: normal
severity: normal
status: open
title: Review and document string format accepted in numeric data type 
constructors
type: feature request
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10581] Review and document string format accepted in numeric data type constructors

2010-11-29 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

See also issue 9574 for a somewhat related discussion.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10581
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com