Tom Christiansen tchr...@perl.com added the comment:
I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
which fully implements tr11. It includes Unicode::GCString, a class
that has a columns() method to determine the print columns. This is very
fancy in the case of Asian
Tom Christiansen tchr...@perl.com added the comment:
Martin v. L=C3=B6wis mar...@v.loewis.de added the comment:
Martin, I think you meant to write if w =3D=3D 'A':.
Some very common characters have ambiguous widths though (e.g. the Greek =
alphabet), so you can't just raise an error for them
Tom Christiansen tchr...@perl.com added the comment:
Martin v. L=C3=B6wis mar...@v.loewis.de added the comment:
I would encourage you to look at the Perl CPAN module Unicode::LineBreak,
which fully implements tr11.
Thanks for the pointer!
If you'd like, I can show you a program that uses
Tom Christiansen tchr...@perl.com added the comment:
Yes, it looks good. Thank you very much.
-tom
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12753
Tom Christiansen tchr...@perl.com added the comment:
Martin v. Löwis mar...@v.loewis.de added the comment:
I think the WideCharToMultibyte approach is just incorrect.
I'm -1 on using wcswidth, though.
Like you, I too seriously question using wcswidth() for this at all:
The wcswidth
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote
on Sun, 09 Oct 2011 13:21:00 -:
Here is a new patch that stores the names of aliases and named
sequences in the Private Use Area.
Looks good! Thanks!
--tom
--
title: \N
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote
on Mon, 03 Oct 2011 04:15:51 -:
But it still has to happen at compile time, of course, so I don't know
what you could do in Python. Is there any way to change how the compiler
behaves even
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote
on Sun, 02 Oct 2011 06:46:26 -:
Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely bec=
ause that's a Unicode 1 name, and nowadays these codepoints are simply mark=
ed
Tom Christiansen tchr...@perl.com added the comment:
Really? White space makes things harder to read? I thought Pythonistas
believed the opposite of that.
I was surprised at that too ;-). One person's opinion in a specific
context. Don't generalize.
The example I initially showed
Tom Christiansen tchr...@perl.com added the comment:
Martin v. Löwis rep...@bugs.python.org wrote
on Sat, 01 Oct 2011 10:59:48 -:
* Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.
Where did you get that definition from? UTS#18 defines
word_character, which is Alphabetic + U
Tom Christiansen tchr...@perl.com added the comment:
Perl does not provide the old 1.0 names at all. We don't have a Unicode
1.0 legacy to support, which makes this cleaner. However, we do provide
for the names of the C0 and C1 Control Codes, because apart from Unicode
1.0, they don't
Tom Christiansen tchr...@perl.com added the comment:
Martin v. Löwis mar...@v.loewis.de added the comment:
Split S into words. Change the first letter in a word to upper-case,
Except that I think you actually mean that the first letter is
changed into titlecase not uppercase.
One might
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti ezio.melo...@gmail.com added the comment:
Leaving named sequences for unicodedata.lookup() only (and not for
\N{}) makes sense.
There are certainly advantages to that strategy: you don't have to
deal with [\N{sequence}] issues
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote
on Mon, 19 Sep 2011 11:11:48 -:
We could also look at what other languages do and/or ask to the
Unicode consortium.
I will look at what Java does a bit later on this morning, which
Tom Christiansen tchr...@perl.com added the comment:
No good news on the Java front. They do all kinds of things wrong.
For example, they allow intermixed CESU-8 and UTF-8 in a real UTF-8
input stream, which is illegal. There's more they do wrong, including
in their documentation, but I
Tom Christiansen tchr...@perl.com added the comment:
It appears that I'm right about surrogates, but wrong about
noncharacters. I'm seeking a clarification there.
--tom
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12729
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy rep...@bugs.python.org wrote
on Thu, 08 Sep 2011 18:56:11 -:
On 9/8/2011 4:32 AM, Ezio Melotti wrote:
So to summarize a bit, there are different possible level of strictness:
1) all the possible encodable values
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote
on Sat, 03 Sep 2011 00:28:03 -:
Ezio Melotti ezio.melo...@gmail.com added the comment:
Or they are still called UTF-8 but used in combination with different error
handlers, like
Tom Christiansen tchr...@perl.com added the comment:
Antoine Pitrou rep...@bugs.python.org wrote
on Mon, 29 Aug 2011 13:21:06 -:
It's not only typographically speaking, it's really a spelling error,
even in hand-written text :-)
Sure, and so too is omitting an accent mark
Tom Christiansen tchr...@perl.com added the comment:
Antoine Pitrou rep...@bugs.python.org wrote on Sat, 27 Aug 2011 20:04:56
-:
Neither am I. Even in old-style English with ae and oe, one wrote
ÆGYPT and ÆSIR all caps but Ægypt and Æsir in titlecase, not *Aegypt or
*Aesir. Similarly
Tom Christiansen tchr...@perl.com added the comment:
Guido van Rossum rep...@bugs.python.org wrote
on Sat, 27 Aug 2011 03:26:21 -:
To me, making (default) iteration deviate from indexing is anathema.
So long is there's a way to interate through a string some other way
that by code
Tom Christiansen tchr...@perl.com added the comment:
Guido van Rossum rep...@bugs.python.org wrote
on Fri, 26 Aug 2011 21:11:24 -:
Would this also affect .islower() and friends?
SHORT VERSION: (7 lines)
I don't believe so, but the relationship between lower() and islower
Tom Christiansen tchr...@perl.com added the comment:
Guido van Rossum rep...@bugs.python.org wrote
on Sat, 27 Aug 2011 16:15:33 -:
Although personally I don't have much of an intuition for what
titlecase means (and why it's important), perhaps because I'm not
familiar with any
Tom Christiansen tchr...@perl.com added the comment:
Sounds like a fair feature request for Python 3.3, as long as the
intention is that users must import some module from the standard
library and use functions defined in that module. The operations and
methods defined for str instances
Tom Christiansen tchr...@perl.com added the comment:
Guido van Rossum rep...@bugs.python.org wrote
on Fri, 26 Aug 2011 21:16:57 -:
Yeah, this should be fixed in 3.3 and probably backported to 3.2
and 2.7. (There is already no guarantee that len(s) ==
len(s.title()), right?)
Well
Tom Christiansen tchr...@perl.com added the comment:
Raymond Hettinger raymond.hettin...@gmail.com added the comment:
I would like to be involved in the design of the API for a UCA module
and its routines for loading Unicode Collation Element Tables (not
making the mistake of using global
Tom Christiansen tchr...@perl.com added the comment:
I should probably mention the importance in the design of a UCA module of
being able to specify which UCA version number you want it to behave like
in case you plan to override some of the DUCET entries. That way if you
run under a later UCA
Tom Christiansen tchr...@perl.com added the comment:
Guido van Rossum rep...@bugs.python.org wrote
on Fri, 26 Aug 2011 21:55:03 -:
I know I sound like NIH, but I'm always reluctant to add a big 3rd
party lib like ICU to the permanent dependencies of all future Python
distros
Tom Christiansen tchr...@perl.com added the comment:
Guido van Rossum rep...@bugs.python.org wrote
on Fri, 26 Aug 2011 21:11:24 -:
Guido van Rossum gu...@python.org added the comment:
I presume this applies to builtin str methods like .lower(), right? I
think it is a good thing
Tom Christiansen tchr...@perl.com added the comment:
Here’s my casing test suite; I thought I sent it in but the mux file here isn’t
the full thing.
It does several things, including letting you run it with regex vs re. It
also checks for the islower, etc functions. It has both simple
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy rep...@bugs.python.org wrote
on Fri, 19 Aug 2011 22:50:58 -:
My current opinion is that adding the aliases might be done in current
releases. It certainly would serve the any user who does not know to
misspell
Tom Christiansen tchr...@perl.com added the comment:
Matthew Barnett rep...@bugs.python.org wrote
on Fri, 19 Aug 2011 23:36:45 -:
For the Line_Break property, one of the possible values is
Inseparable, with 2 permitted aliases, the shorter IN (which
is reasonable) and Inseperable
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti ezio.melo...@gmail.com added the comment:
I think the 4 macros:
#define _Py_UNICODE_ISSURROGATE
#define _Py_UNICODE_ISHIGHSURROGATE
#define _Py_UNICODE_ISLOWSURROGATE
#define _Py_UNICODE_JOIN_SURROGATES
are quite
Tom Christiansen tchr...@perl.com added the comment:
I now see there are lots of good things in the BOM FAQ that have come up
lately regarding surrogates and other illegal characters, and about what
can go in data streams.
I quote a few of these from http://unicode.org/faq/utf_bom.html below
Tom Christiansen tchr...@perl.com added the comment:
Antoine Pitrou rep...@bugs.python.org wrote
on Tue, 16 Aug 2011 09:18:46 -:
I think the 4 macros:
#define _Py_UNICODE_ISSURROGATE
#define _Py_UNICODE_ISHIGHSURROGATE
#define _Py_UNICODE_ISLOWSURROGATE
#define
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote
on Tue, 16 Aug 2011 09:23:50 -:
All the other macros[0] follow the same convention, e.g. Py_UNICODE_ISLOWER
and Py_UNICODE_TOLOWER. I agree that keeping the words separate makes them
more
Tom Christiansen tchr...@perl.com added the comment:
Marc-Andre Lemburg rep...@bugs.python.org wrote
on Tue, 16 Aug 2011 12:11:22 -:
The reasoning behind e.g. ISSURROGATE is that those names originate
from and are consistent with the already existing ISLOWER/ISUPPER/ISTITLE
macros
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote on Mon, 15 Aug 2011 04:56:55 -:
Another thing I noticed is that (at least on wide builds) surrogate pairs are
not joined on the fly:
p
'\ud800\udc00'
len(p)
2
p.encode('utf-16').decode
Changes by Tom Christiansen tchr...@perl.com:
--
nosy: +tchrist
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12746
___
___
Python-bugs-list
New submission from Tom Christiansen tchr...@perl.com:
Unicode character names share a common namespace with formal aliases and with
named sequences, but Python recognizes only the original name. That means not
everything in the namespace is accessible from Python. (If this is construed
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy tjre...@udel.edu added the comment:
My Firefox is already set at utf-8. More likely a font limitation. I
will look again after installing one of the fonts Tom suggested.
Symbola is best for exotic glyphs, especially astral
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy tjre...@udel.edu added the comment:
You are right, FF switched on me without notice. Bad FF. Thank you! What
I now see makes much more sense.
[ мЯхШщЯл, мЯхШщЯл, ДЯхШщЯл, ДЇНЀСЇГ ],
and I now know to check on other
Tom Christiansen tchr...@perl.com added the comment:
Sorry I didn't include a test case. Hope this makes up for it. If not, please
tell me how to write better test cases. :(
Yeah ok, so I'm a bit persnickety or even unorthodox about my vertical
alignment, but it really helps to make what
Tom Christiansen tchr...@perl.com added the comment:
Oh whoops, that was the long ticket. Shall I reupload to the right number?
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12734
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy tjre...@udel.edu added the comment:
Adding Symbola filled in the symbols and emoticons lines.
The gothic chars are still missing even with Alfios.
That's too bad, as the Gothic paternoster is kinda cute. :)
Hm, I wonder where
Tom Christiansen tchr...@perl.com added the comment:
Here’s the right test file for the right ticket.
--
Added file: http://bugs.python.org/file22903/nametests.py
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12753
Changes by Tom Christiansen tchr...@perl.com:
Removed file: http://bugs.python.org/file22902/nametests.py
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12734
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti ezio.melo...@gmail.com added the comment:
It is simply a design error to pretend that the number of characters
is the number of code units instead of code points. A terrible and
ugly one, but it does not mean you are UCS-2
New submission from Tom Christiansen tchr...@perl.com:
On neither narrow nor wide builds does this UTF8-encoded bit run without
raising an exception:
if re.search([풜-풵], 풞, re.UNICODE):
print(match 1 passed)
else:
print(match 2 failed)
The best you can possibly do
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote
on Sun, 14 Aug 2011 07:15:09 -:
Unicode says you can't put surrogates or noncharacters in a
UTF-anything stream. It's a bug to do so and pretend it's a
UTF-whatever.
The UTF-8 codec
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti ezio.melo...@gmail.com added the comment:
On wide 3.2 it passes too, so the failure is limited to narrow builds (are =
you sure that it fails on wide builds for you?).
You're right: my wide build is not Python3, just Python2
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote
on Sun, 14 Aug 2011 07:15:09 -:
For example I don't think removing the 0x10 upper limit is going to
happen -- even if it might be useful for other things.
I agree entirely. That's why
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote on Sun, 14 Aug 2011 17:15:52 -:
You're right: my wide build is not Python3, just Python2.
And is it failing? Here the tests pass on the wide builds, on both Python 2
and 3.
Perhaps I am
Tom Christiansen tchr...@perl.com added the comment:
Ezio Melotti rep...@bugs.python.org wrote
on Sun, 14 Aug 2011 17:46:55 -:
I'm a bit confused on this. You no longer fix bugs in Python 2?
We do, but it's unlikely that we will introduce major changes in behavior.
Even if we had
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy rep...@bugs.python.org wrote
on Mon, 15 Aug 2011 00:26:53 -:
PS: The OSCON link in msg142036 currently gives me 404 not found
Sorry, I wrote
http://training.perl.com/OSCON/index.html
but meant
http
Tom Christiansen tchr...@perl.com added the comment:
I wrote:
Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.
So I'm finding. Perhaps that's why I keep getting confused. I do have a
pretty firm
notion of what UCS-2 and UTF-16 are, and so I get sometimes
self
Tom Christiansen tchr...@perl.com added the comment:
David Murray rep...@bugs.python.org wrote:
Tom, note that nobody is arguing that what you are requesting is a bad
thing :)
There looked to be minor some resistance, based on absolute backwards
compatibility even if wrong, regarding
Tom Christiansen tchr...@perl.com added the comment:
Matthew Barnett rep...@bugs.python.org wrote
on Sat, 13 Aug 2011 20:57:40 -:
There are occasions when you want to do string slicing, often of the form:
pos = my_str.index(x)
endpos = my_str.index(y)
substring = my_str[pos
Tom Christiansen tchr...@perl.com added the comment:
Antoine Pitrou rep...@bugs.python.org wrote
on Sat, 13 Aug 2011 21:09:52 -:
And/or a lookup table giving the byte offset of, say, every 16th
character. It gives you a O(1) lookup with a relatively reasonable
constant cost (you have
Tom Christiansen tchr...@perl.com added the comment:
Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds.
Perhaps someone could tell me why the Python documentation says it uses
UCS-2 on a narrow build.
There's a disagreement on that point between several developers
Tom Christiansen tchr...@perl.com added the comment:
Whoops, I meant that it appears that Python runs its identifiers through NFC.
How that gets along with a filesystem that has quasi-NFD filenames I'm not
sure, but it seems like it might be a variant of the case-insensitivity issue
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy tjre...@udel.edu added the comment:
I am not sure that everyone will agree that this is a bug, rather than a fe=
ature request, or that if a bug, that it should be changed in existing rele=
ases and possibly break running
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy rep...@bugs.python.org wrote
on Fri, 12 Aug 2011 22:21:59 -:
Does the regex module handle these particular issues better?
No, it currently does not. One would have to ask Matthew directly, but I
believe
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy tjre...@udel.edu added the comment:
However desireable it would be, I do not believe there is any claim in the =
manual that the re module follows the evolving Unicode consortium r.e. stan=
My from the hip thought
Tom Christiansen tchr...@perl.com added the comment:
Terry J. Reedy rep...@bugs.python.org wrote
on Fri, 12 Aug 2011 23:05:27 -:
Ouch!
Do the rejected characters qualify as identifier characters as defined
in Reference 2.3 Identifiers and keywords?
http://docs.python.org/py3k
New submission from Tom Christiansen tchr...@perl.com:
The Python re library is broken in its approach to case-insensitive matches. It
erroneously attempts to compare lowercase mappings. This is wrong. You must
compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get
wrong
New submission from Tom Christiansen tchr...@perl.com:
Python is in flagrant violation of the very most basic premises of Unicode
Technical Report #18 on Regular Expressions, which requires that a regex engine
support Unicode characters as basic logical units independent of serialization
like
New submission from Tom Christiansen tchr...@perl.com:
You cannot use Python's casemapping functions on Unicode data because they fail
on narrow builds. This makes it impossible to write portable code in Python
that can cope with full Unicode.
I've tried several times to submit this bug
New submission from Tom Christiansen tchr...@perl.com:
You cannot use Python's lib re for handling Unicode regular expressions because
it violates the standard set out for the same in UTS#18 on Unicode Regular
Expressions in RL1.2a on compatibility properties. What \w is allowed to match
New submission from Tom Christiansen tchr...@perl.com:
You cannot reliably use Unicode in Python identifiers because of the
narrow/wide build issue. The enclosed file is fine on wide builds but gets
compiler errors on narrow ones during compilation.
Go, Ruby, Java, and Perl all handle
Changes by Tom Christiansen tchr...@perl.com:
--
components: +Regular Expressions -Library (Lib)
type: - behavior
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12728
New submission from Tom Christiansen tchr...@perl.com:
Without proper grapheme support in the regular expression library, it is
impossible to correctly process Unicode. And the very least, one needs the \X
escape supported, which is an extended grapheme cluster per UTS#18. This escape
New submission from Tom Christiansen tchr...@perl.com:
Python supports no Unicode properties in its re library, making it unsuitable
for work with Unicode. This is therefore a formal request for the Python re
library to support Unicode properties.
The eleven properties required by Unicode
New submission from Tom Christiansen tchr...@perl.com:
Python has no standard support for the Unicode Collation Library as explained
in UTS #10. This is request that UCA library be added to the standard Python
distribution.
Collation underlies virtually everything we do with text, not just
New submission from Tom Christiansen tchr...@perl.com:
Python's casemapping functions only use what Unicode calls simple casemaps.
These are only appropriate for functions that operate on single characters
alone, not for those that operate on strings. The reason for this is that you
get much
New submission from Tom Christiansen tchr...@perl.com:
Python's string.title() function claims it titlecases the first letter in each
word and lowercases the rest. However, this is not true. It is not using
either of the two word detection algorithms that Unicode provides. One allows
you
Tom Christiansen tchr...@perl.com added the comment:
I've been a lot of testing of Matthew's regex library against UTS#18 issues,
but only somewhat incidentally testing re. To use regex, one has to accept that
certain things will work differently than they work in re, because he is
following
Tom Christiansen tchr...@perl.com added the comment:
I can attest that being able to get the columns of a grapheme cluster is very
important for printing, because you need this to do correct linebreaking.
There might be something you can steal from
http://search.cpan.org/perldoc?Unicode
Tom Christiansen tchr...@perl.com added the comment:
How does this work for modules that have filesystem names different from the
one used for import? The issue I'm thinking about is that the Mac HSF+
filesystem keeps its Unicode filenames in (close to) NFD form. That means that
a module
Tom Christiansen tchr...@perl.com added the comment:
Please do not call this utf-8-java. It is called cesu-8 per UTS#18 at:
http://unicode.org/reports/tr26/
CESU-8 is *not* a a valid Unicode Transform Format and should not be called
UTF-8. It is a real pain in the butt, caused by people who
80 matches
Mail list logo