Martin v. Löwis wrote:
> David Hopwood schrieb:
>
>>>If you have access to "German Windows XP", "Japanese Windows XP",
>>
>>Since Win2K there is actually no such thing, from a technical point of view --
>>just Win2K or WinXP with a G
Paul Prescod wrote:
> On 9/25/06, Jim Jewett <[EMAIL PROTECTED]> wrote:
>
>> As David Hopwood pointed out, to be fully correct, you already have to
>> create a custom function even with bmp characters, because of
>> decomposed characters. (Example: Representi
Fredrik Lundh wrote:
> David Hopwood wrote:
>
>>For example, "ö" can be represented either as the precomposed character
>>U+00F6,
>>or as "o" followed by a combining diaeresis (U+006F U+0308).
>
> normalization is a good thing, though:
>
>
ional texts
necessarily differs from working with ASCII. There is no excuse for any
programmer doing text processing not to have read it.
Should we nevertheless try to avoid making the use of Unicode strings
unnecessarily difficult for people who have minimal knowledge
yond the wit of those editor
developers to talk to each other, or to just unilaterally support the
other editor's format as well as their own.
--
David Hopwood <[EMAIL PROTECTED]>
___
Python-3000 mailing list
Python-3000@python.org
http://ma
some locales
recently added in Windows 2000/XP, where there was no compatibility
constraint to use a non-Unicode encoding.
You're correct about the use of a BOM as a signature. All Unicode-conformant
applications should accept this use of a BOM in UTF-8 (although they need
not generate it); the s
ams, and this is
an advantage over algorithms that don't work for streams.
--
David Hopwood <[EMAIL PROTECTED]>
___
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe:
http://mail
Paul Prescod wrote:
> On 9/10/06, David Hopwood <[EMAIL PROTECTED]> wrote:
>
>> ... if you think that guessing based on content is a good idea -- I
>> don't. In any case, such guessing necessarily depends on the expected file
>> format, so it should be done
Paul Prescod wrote:
> Maybe the guessing algorithm should read the WHOLE FILE.
That wouldn't work for streams (e.g. stdin). The algorithm I gave
does work for streams, provided that they have a push-back buffer of
at least 4 bytes.
--
David Hopwood <[EMAI
Josiah Carlson wrote:
> David Hopwood <[EMAIL PROTECTED]> wrote:
>
>>Here is a very simple, reasonably (although not completely) safe, and much
>>more predictable guessing algorithm, based on a generalization of
>><http://www.w3.org/TR/REC-xml/#sec-guessing>
g.
>>The 'additional symbolic values' should be implemented as true
>>encodings (i.e., it should be possible to look up 'site', 'guess' and
>>'locale' in the codecs registry, and replace them there as well).
>
> Treating different thing
icode -> Shift-JIS -> Unicode; the issue is whether it is encoded as
0x5C, or something else like 0x815F. It may very well not round-trip if you
use different implementations for encoding and decoding.
--
David Hopwood <[EMAIL PROTECTED]>
___
<http://wakaba-web.hp.infoseek.co.jp/table/sjis-0208-1997-std.txt>
although there is quite a bit of variation in mappings:
<http://www.haible.de/bruno/charsets/conversion-tables/Shift_JIS.html>
--
David Hopwood <[EMAIL PROTECTED]>
___
Michael Urman wrote:
> On 9/7/06, David Hopwood <[EMAIL PROTECTED]> wrote:
>
>>Yes. However, this is not a good idea for precisely the reason described
>>on that page (false detection of Unicode), and so any Unicode detection
>>algorithm in Python should only be
changes. It uses BOMs to mark all unicode encodings, but doesn't
> require them to be present in order to detect "Unicode."
> http://blogs.msdn.com/michkap/archive/2006/06/14/631016.aspx
Yes. However, this is not a good idea for precisely
David Hopwood wrote:
> Paul Prescod wrote:
>
>>Guido has asked me to do some research in aid of a file encoding
>>detection/defaulting PEP.
>>
>>I only have access to a small number of operating systems and language
>>variants so I need help.
>>
See <http://www.microsoft.com/globaldev/DrIntl/faqs/Locales.mspx>,
<http://www.microsoft.com/globaldev/reference/WinCP.mspx>, and
<http://www.microsoft.com/globaldev/reference/win2k/setup/localsupport.mspx>.
Each "language group" maps to a similarly named "ANSI" code page (a
Jim Jewett wrote:
> On 9/4/06, David Hopwood <[EMAIL PROTECTED]> wrote:
>
>> The issue is not simplicity of implementation; it is what will provide
>> the simplest usage model in the long term. If new files are encoded in X
>> just because most of a user's ex
Guido van Rossum wrote:
> On 9/4/06, David Hopwood <[EMAIL PROTECTED]> wrote:
>> Guido van Rossum wrote:
>>
>> > I've always said (can someone find a quote perhaps?) that there ought
>> > to be a sensible default encoding for files (including but
Paul Prescod wrote:
> On 9/5/06, David Hopwood <[EMAIL PROTECTED]> wrote:
>> Guido van Rossum wrote:
>> > On 9/5/06, Brian Quinlan <[EMAIL PROTECTED]> wrote:
>> > [...]
>> >
>> > That would not be doing what the user wants. We have extensi
Guido van Rossum wrote:
> On 9/5/06, David Hopwood <[EMAIL PROTECTED]> wrote:
>> Guido van Rossum wrote:
>> > On 9/5/06, Paul Prescod <[EMAIL PROTECTED]> wrote:
>> >
>> >> Beyond all of that: It just seems wrong to me that I could send
>> &
he system ("ANSI")
encoding will be Cp1252-with-Euro (which is similar enough to ISO-8859-1
if C1 control characters are not used).
--
David Hopwood <[EMAIL PROTECTED]>
___
Python-3000 mailing list
Python-3000@python.org
http://mai
David Hopwood wrote:
> I don't know about vi, but notepad will open and save files that are not in
> the system ("ANSI") encoding just fine. On opening it checks for a BOM and
> auto-detects UTF-8 and UTF-16; on saving it will write a BOM if you choose
> "Unicode&
ding and writing files in charsets that are
not the system default. So in practice the locale has to be set to the "old"
charset during a migration to UTF-8.
(Setting different locales for different applications is far too much hassle.
On Windows, although I believe it is technically possible to
as a character count.
For charsets like ISCII and ISO 2022, which are stateful and/or have
a different encoding model to Unicode, I don't believe this approach
would work very well. But it is fine to support this for some charsets
and not others.
--
David Hopwood <[EMAIL PROTECTED]&g
te sequence. note it's a *byte* sequence, not chars,
> since this passes down to layer 1 transparently.
That isn't what is required; for big-endian UCS-2 or UTF-16, "\x00\x0a"
should only be recognized as LF if it is at an even byte position.
--
David Hopwood <[EMAIL PRO
eclared
as PyObject *.)
The 'operation' string is sometimes a gerund ("slicing", etc.) and sometimes
the name of a method. This should be more consistent.
> + WARN_LIST_USAGE(a, PY_REMAIN_LIST, "repitition");
"repetition"
--
David Hopwood &
d = s.find("}", posarg)
> except ValueError:
> break
try:
posstart = s.index("{", pos)
posarg = s.index(" ", posstart)
posend = s.find("}", posarg)
except ValueEr
28 matches
Mail list logo