Re: Python 2.6 StreamReader.readline()

2012-07-25 Thread Walter Dörwald

On 25.07.12 08:09, Ulrich Eckhardt wrote:


Am 24.07.2012 17:01, schrieb cpppw...@gmail.com:

 reader = codecs.getreader(encoding)
 lines  =  []
 with open(filename, 'rb') as f:
 lines  = reader(f, 'strict').readlines(keepends=False)

where encoding == 'utf-16-be'
Everything works fine, except that lines[0] is equal to
codecs.BOM_UTF16_BE
Is this behaviour correct, that the BOM is still present?


Yes, assuming the first line only contains that BOM. Technically it's a
space character, and why should those be removed?


If the first "character" in the file is a BOM the file encoding is 
probably not utf-16-be but utf-16.


Servus,
   Walter

--
http://mail.python.org/mailman/listinfo/python-list


Re: Issues with `codecs.register` and `codecs.CodecInfo` objects

2012-07-10 Thread Walter Dörwald

On 07.07.12 04:56, Steven D'Aprano wrote:


On Fri, 06 Jul 2012 12:55:31 -0400, Karl Knechtel wrote:


Hello all,

While attempting to make a wrapper for opening multiple types of
UTF-encoded files (more on that later, in a separate post, I guess), I
ran into some oddities with the `codecs` module, specifically to do with
`.register` ing `CodecInfo` objects. I'd like to report a bug or
something, but there are several intertangled issues here and I'm not
really sure how to report it so I thought I'd open the discussion.
Apologies in advance if I get a bit rant-y, and a warning that this is
fairly long.

[...]

Yes, it's a strangely indirect API, and yes it looks like you have
identified a whole bucket full of problems with it. And no, I don't know
why that API was chosen.


This API was chosen for backwards compatibility reasons when incremental 
encoders/decoders were introduced (in 2006).


And yes: We missed the opportunity to clean that up to always use CodecInfo.


Changing to a cleaner, more direct (sensible?) API would be a fairly big
step. If you want to pursue this, the steps I recommend you take are:

1) understanding the reason for the old API (search the Internet
and particularly the python-...@python.org archives);


See e.g. http://mail.python.org/pipermail/patches/2006-March/019122.html


2) have a plan for how to avoid breaking code that relies on the
existing API;

3) raise the issue on python-id...@python.org to gather feedback
and see how much opposition or support it is likely to get;
they'll suggest whether a bug report is sufficient or if you'll
need a PEP;

http://www.python.org/dev/peps/


If you can provide a patch and a test suite, you will have a much better
chance of pushing it through. If not, you are reliant on somebody else
who can being interested enough to do the work.

And one last thing: any new functionality will simply *not* be considered
for Python 2.x. Aim for Python 3.4, since the 2.x series is now in bug-
fix only maintenance mode and the 3.3 beta is no longer accepting new
functionality, only bug fixes.


Servus,
   Walter
--
http://mail.python.org/mailman/listinfo/python-list


Re: Why are some unicode error handlers "encode only"?

2012-03-11 Thread Walter Dörwald

On 11.03.12 15:37, Steven D'Aprano wrote:


At least two standard error handlers are documented as working for
encoding only:

xmlcharrefreplace
backslashreplace

See http://docs.python.org/library/codecs.html#codec-base-classes

and http://docs.python.org/py3k/library/codecs.html

Why is this? I don't see why they shouldn't work for decoding as well.


Because xmlcharrefreplace and backslashreplace are *error* handlers. 
However the bytes sequence b'〹' does *not* contain any bytes that 
are not decodable for e.g. the ASCII codec. So there are no errors to 
handle.



Consider this example using Python 3.2:


b"aaa--\xe9z--\xe9!--bbb".decode("cp932")

Traceback (most recent call last):
   File "", line 1, in
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
illegal multibyte sequence

The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
or can't be supported?


The byte sequence b'\xe9!' however is not something that would have been 
produced by the backslashreplace error handler. b'\\xe9!' (a sequence 
containing 5 bytes) would have been (and this probably would decode 
without any problems with the cp932 codec).



# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=>  r'aaa--騷--\xe9\x21--bbb'

and similarly for xmlcharrefreplace.


This would require a postprocess step *after* the bytes have been 
decoded. This is IMHO out of scope for Python's codec machinery.


Servus,
   Walter

--
http://mail.python.org/mailman/listinfo/python-list


Re: replacing words in HTML file

2010-04-30 Thread Walter Dörwald
On 28.04.10 15:02, james_027 wrote:
> hi,
> 
> Any idea how I can replace words in a html file? Meaning only the
> content will get replace while the html tags, javascript, & css are
> remain untouch.

You could try XIST (http://www.livinglogic.de/Python/xist/):

Example code:

from ll.xist import xsc, parsers

def p2p(node, converter):
if isinstance(node, xsc.Text):
node = node.replace("Python", "Parrot")
node = node.replace("python", "parrot")
return node

node = parsers.parseurl("http://www.python.org/";, tidy=True)

node = node.mapped(p2p)
node.write(open("parrot_index.html", "wb"))


Hope that helps!

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to write a unicode string to a file ?

2009-10-19 Thread Walter Dörwald
On 17.10.09 08:28, Mark Tolonen wrote:
> 
> "Kee Nethery"  wrote in message
> news:aaab63c6-6e44-4c07-b119-972d4f49e...@kagi.com...
>>
>> On Oct 16, 2009, at 5:49 PM, Stephen Hansen wrote:
>>
>>> On Fri, Oct 16, 2009 at 5:07 PM, Stef Mientki 
>>>  wrote:
>>
>> snip
>>
>>> The thing is, I'd be VERY surprised (neigh, shocked!) if Excel can't
>>> open a file that is in UTF8-- it just might need to be TOLD that its
>>> utf8 when you go and open the file, as UTF8 looks just like ASCII -- 
>>> until it contains characters that can't be expressed in ASCII. But I
>>> don't know what type of file it is you're saving.
>>
>> We found that UTF-16 was required for Excel. It would not "do the 
>> right thing" when presented with UTF-8.
> 
> Excel seems to expect a UTF-8-encoded BOM (byte order mark) to correctly
> decide a file is written in UTF-8.  This worked for me:
> 
> f=codecs.open('test.csv','wb','utf-8')
> f.write(u'\ufeff') # write a BOM
> f.write(u'马克,testing,123\r\n')
> f.close()

That can also be done with the utf-8-sig codec (which adds a BOM at the
start on writing):

f = codecs.open('test.csv','wb','utf-8-sig')
f.write(u'马克,testing,123\r\n')
f.close()

See http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTMLgen???

2009-10-16 Thread Walter Dörwald
On 16.10.09 05:44, alex23 wrote:
> On Oct 15, 6:58 pm, an...@vandervlies.xs4all.nl wrote:
>> Does HTMLgen (Robin Friedrich's) still exsist?? And, if so, where can it
>> be found?
> 
> If you're after an easy to use html generator, I highly recommend
> Richard Jones' html[1] lib. It's new, supported and makes very nice
> use of context managers.
> 
> [1]: http://pypi.python.org/pypi/html

Another alternative is XIST at http://www.livinglogic.de/Python/xist/
which supports more than simple HTML. Examples can be found here:
http://www.livinglogic.de/Python/xist/Examples.html

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode issue

2009-10-01 Thread Walter Dörwald
On 01.10.09 17:50, Rami Chowdhury wrote:
> On Thu, 01 Oct 2009 08:10:58 -0700, Walter Dörwald
>  wrote:
> 
>> On 01.10.09 16:09, Hyuga wrote:
>>> On Sep 30, 3:34 am, gentlestone  wrote:
>>>> Why don't work this code on Python 2.6? Or how can I do this job?
>>>>
>>>> [snip _MAP]
>>>>
>>>> def downcode(name):
>>>> """
>>>> >>> downcode(u"Žabovitá zmiešaná kaša")
>>>> u'Zabovita zmiesana kasa'
>>>> """
>>>> for key, value in _MAP.iteritems():
>>>> name = name.replace(key, value)
>>>> return name
>>>
>>> Though C Python is pretty optimized under the hood for this sort of
>>> single-character replacement, this still seems pretty inefficient
>>> since you're calling replace for every character you want to map.  I
>>> think that a better approach might be something like:
>>>
>>> def downcode(name):
>>> return ''.join(_MAP.get(c, c) for c in name)
>>>
>>> Or using string.translate:
>>>
>>> import string
>>> def downcode(name):
>>> table = string.maketrans(
>>> 'ÀÁÂÃÄÅ...',
>>> 'AA...')
>>> return name.translate(table)
>>
>> Or even simpler:
>>
>> import unicodedata
>>
>> def downcode(name):
>>return unicodedata.normalize("NFD", name)\
>>   .encode("ascii", "ignore")\
>>   .decode("ascii")
>>
>> Servus,
>>Walter
> 
> As I understand it, the "ignore" argument to str.encode *removes* the
> undecodable characters, rather than replacing them with an ASCII
> approximation. Is that correct? If so, wouldn't that rather defeat the
> purpose?

Yes, but any accented characters have been split into the base character
and the combining accent via normalize() before, so only the accent gets
removed. Of course non-decomposable characters will be removed
completely, but it would be possible to replace

   .encode("ascii", "ignore").decode("ascii")

with something like this:

   u"".join(c for c in name if unicodedata.category(c) == "Mn")

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode issue

2009-10-01 Thread Walter Dörwald
On 01.10.09 16:09, Hyuga wrote:
> On Sep 30, 3:34 am, gentlestone  wrote:
>> Why don't work this code on Python 2.6? Or how can I do this job?
>>
>> _MAP = {
>> # LATIN
>> u'À': 'A', u'Á': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Å': 'A',
>> u'Æ': 'AE', u'Ç':'C',
>> u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'Ì': 'I', u'Í': 'I',
>> u'Î': 'I',
>> u'Ï': 'I', u'Ð': 'D', u'Ñ': 'N', u'Ò': 'O', u'Ó': 'O', u'Ô': 'O',
>> u'Õ': 'O', u'Ö':'O',
>> u'Ő': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ü': 'U',
>> u'Ű': 'U',
>> u'Ý': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a',
>> u'ã': 'a', u'ä':'a',
>> u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e',
>> u'ë': 'e',
>> u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n',
>> u'ò': 'o', u'ó':'o',
>> u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u',
>> u'ú': 'u',
>> u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y',
>> # LATIN_SYMBOLS
>> u'©':'(c)',
>> # GREEK
>> u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z',
>> u'η':'h', u'θ':'8',
>> u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3',
>> u'ο':'o', u'π':'p',
>> u'ρ':'r', u'σ':'s', u'τ':'t', u'υ':'y', u'φ':'f', u'χ':'x',
>> u'ψ':'ps', u'ω':'w',
>> u'ά':'a', u'έ':'e', u'ί':'i', u'ό':'o', u'ύ':'y', u'ή':'h',
>> u'ώ':'w', u'ς':'s',
>> u'ϊ':'i', u'ΰ':'y', u'ϋ':'y', u'ΐ':'i',
>> u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z',
>> u'Η':'H', u'Θ':'8',
>> u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Μ':'M', u'Ν':'N', u'Ξ':'3',
>> u'Ο':'O', u'Π':'P',
>> u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X',
>> u'Ψ':'PS', u'Ω':'W',
>> u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H',
>> u'Ώ':'W', u'Ϊ':'I', u'Ϋ':'Y',
>> # TURKISH
>> u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C',
>> u'ü':'u', u'Ü':'U',
>> u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G',
>> # RUSSIAN
>> u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e',
>> u'ё':'yo', u'ж':'zh',
>> u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m',
>> u'н':'n', u'о':'o',
>> u'п':'p', u'р':'r', u'с':'s', u'т':'t', u'у':'u', u'ф':'f',
>> u'х':'h', u'ц':'c',
>> u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'',
>> u'э':'e', u'ю':'yu', u'я':'ya',
>> u'А':'A', u'Б':'B', u'В':'V', u'Г':'G', u'Д':'D', u'Е':'E',
>> u'Ё':'Yo', u'Ж':'Zh',
>> u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M',
>> u'Н':'N', u'О':'O',
>> u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F',
>> u'Х':'H', u'Ц':'C',
>> u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'',
>> u'Э':'E', u'Ю':'Yu', u'Я':'Ya',
>> # UKRAINIAN
>> u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ґ':'G', u'є':'ye', u'і':'i',
>> u'ї':'yi', u'ґ':'g',
>> # CZECH
>> u'č':'c', u'ď':'d', u'ě':'e', u'ň':'n', u'ř':'r', u'š':'s',
>> u'ť':'t', u'ů':'u',
>> u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R',
>> u'Š':'S', u'Ť':'T', u'Ů':'U', u'Ž':'Z',
>> # POLISH
>> u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o',
>> u'ś':'s', u'ź':'z',
>> u'ż':'z', u'Ą':'A', u'Ć':'C', u'Ę':'e', u'Ł':'L', u'Ń':'N',
>> u'Ó':'o', u'Ś':'S',
>> u'Ź':'Z', u'Ż':'Z',
>> # LATVIAN
>> u'ā':'a', u'č':'c', u'ē':'e', u'ģ':'g', u'ī':'i', u'ķ':'k',
>> u'ļ':'l', u'ņ':'n',
>> u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E',
>> u'Ģ':'G', u'Ī':'i',
>> u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z'
>>
>> }
>>
>> def downcode(name):
>> """
>> >>> downcode(u"Žabovitá zmiešaná kaša")
>> u'Zabovita zmiesana kasa'
>> """
>> for key, value in _MAP.iteritems():
>> name = name.replace(key, value)
>> return name
> 
> Though C Python is pretty optimized under the hood for this sort of
> single-character replacement, this still seems pretty inefficient
> since you're calling replace for every character you want to map.  I
> think that a better approach might be something like:
> 
> def downcode(name):
> return ''.join(_MAP.get(c, c) for c in name)
> 
> Or using string.translate:
> 
> import string
> def downcode(name):
> table = string.maketrans(
> 'ÀÁÂÃÄÅ...',
> 'AA...')
> return name.translate(table)

Or even simpler:

import unicodedata

def downcode(name):
   return unicodedata.normalize("NFD", name)\
  .encode("ascii", "ignore")\
  .decode("ascii")

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote:
>> "correct" -> "corrected"
> 
> Thanks, fixed.
> 
>>> To convert non-decodable bytes, a new error handler "python-escape" is
>>> introduced, which decodes non-decodable bytes using into a private-use
>>> character U+F01xx, which is believed to not conflict with private-use
>>> characters that currently exist in Python codecs.
>> Would this mean that real private use characters in the file name would
>> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
>> any error handler.
> 
> The python-escape codec is only used/meaningful if the env encoding
> is not UTF-8. For any other encoding, it is assumed that no character
> actually maps to the private-use characters.

Which should be true for any encoding from the pre-unicode era, but not
for UTF-16/32 and variants.

>>> The error handler interface is extended to allow the encode error
>>> handler to return byte strings immediately, in addition to returning
>>> Unicode strings which then get encoded again.
>> Then the error callback for encoding would become specific to the target
>> encoding.
> 
> Why would it become specific? It can work the same way for any encoding:
> take U+F01xx, and generate the byte xx.

If any error callback emits bytes these byte sequences must be legal in
the target encoding, which depends on the target encoding itself.

However for the normal use of this error handler this might be
irrelevant, because those filenames that get encoded were constructed in
such a way that reencoding them regenerates the original byte sequence.

>>> If the locale's encoding is UTF-8, the file system encoding is set to
>>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
>> Is this done by the codec, or the error handler? If it's done by the
>> codec I don't see a reason for the "python-escape" error handler.
> 
> utf-8b is a new codec. However, the utf-8b codec is only used if the
> env encoding would otherwise be utf-8. For utf-8b, the error handler
> is indeed unnecessary.

Wouldn't it make more sense to be consistent how non-decodable bytes get
decoded? I.e. should the utf-8b codec decode those bytes to PUA
characters too (and refuse to encode then, so the error handler outputs
them)?

>>> While providing a uniform API to non-decodable bytes, this interface
>>> has the limitation that chosen representation only "works" if the data
>>> get converted back to bytes with the python-escape error handler
>>> also.
>> I thought the error handler would be used for decoding.
> 
> It's used in both directions: for decoding, it converts \xXX to
> U+F01XX. For encoding, U+F01XX will trigger an error, which is then
> handled by the handler to produce \xXX.

But only for non-UTF8 encodings?

Servus,
   Walter
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote:

> I'm proposing the following PEP for inclusion into Python 3.1.
> Please comment.
> 
> Regards,
> Martin
> 
> PEP: 383
> Title: Non-decodable Bytes in System Character Interfaces
> Version: $Revision: 71793 $
> Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
> Author: Martin v. Löwis 
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 22-Apr-2009
> Python-Version: 3.1
> Post-History:
> 
> Abstract
> 
> 
> File names, environment variables, and command line arguments are
> defined as being character data in POSIX; the C APIs however allow
> passing arbitrary bytes - whether these conform to a certain encoding
> or not. This PEP proposes a means of dealing with such irregularities
> by embedding the bytes in character strings in such a way that allows
> recreation of the original byte string.
> 
> Rationale
> =
> 
> The C char type is a data type that is commonly used to represent both
> character data and bytes. Certain POSIX interfaces are specified and
> widely understood as operating on character data, however, the system
> call interfaces make no assumption on the encoding of these data, and
> pass them on as-is. With Python 3, character strings use a
> Unicode-based internal representation, making it difficult to ignore
> the encoding of byte strings in the same way that the C interfaces can
> ignore the encoding.
> 
> On the other hand, Microsoft Windows NT has correct the original

"correct" -> "corrected"

> design limitation of Unix, and made it explicit in its system
> interfaces that these data (file names, environment variables, command
> line arguments) are indeed character data, by providing a
> Unicode-based API (keeping a C-char-based one for backwards
> compatibility).
> 
> [...]
> 
> Specification
> =
> 
> On Windows, Python uses the wide character APIs to access
> character-oriented APIs, allowing direct conversion of the
> environmental data to Python str objects.
> 
> On POSIX systems, Python currently applies the locale's encoding to
> convert the byte data to Unicode. If the locale's encoding is UTF-8,
> it can represent the full set of Unicode characters, otherwise, only a
> subset is representable. In the latter case, using private-use
> characters to represent these bytes would be an option. For UTF-8,
> doing so would create an ambiguity, as the private-use characters may
> regularly occur in the input also.
> 
> To convert non-decodable bytes, a new error handler "python-escape" is
> introduced, which decodes non-decodable bytes using into a private-use
> character U+F01xx, which is believed to not conflict with private-use
> characters that currently exist in Python codecs.

Would this mean that real private use characters in the file name would
raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
any error handler.

> The error handler interface is extended to allow the encode error
> handler to return byte strings immediately, in addition to returning
> Unicode strings which then get encoded again.

Then the error callback for encoding would become specific to the target
encoding. Would this mean that the handler checks which encoding is used
and behaves like "strict" if it doesn't recognize the encoding?

> If the locale's encoding is UTF-8, the file system encoding is set to
> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.

Is this done by the codec, or the error handler? If it's done by the
codec I don't see a reason for the "python-escape" error handler.

> Discussion
> ==
> 
> While providing a uniform API to non-decodable bytes, this interface
> has the limitation that chosen representation only "works" if the data
> get converted back to bytes with the python-escape error handler
> also.

I thought the error handler would be used for decoding.

> Encoding the data with the locale's encoding and the (default)
> strict error handler will raise an exception, encoding them with UTF-8
> will produce non-sensical data.
> 
> For most applications, we assume that they eventually pass data
> received from a system interface back into the same system
> interfaces. For example, and application invoking os.listdir() will

"and" -> "an"

> likely pass the result strings back into APIs like os.stat() or
> open(), which then encodes them back into their original byte
> representation. Applications that need to process the original byte
> strings can obtain them by encoding the character strings with the
> file system encoding, passing "python-escape" as the error handler
> name.

Servus,
   Walter
--
http://mail.python.org/mailman/listinfo/python-list


Re: [2.5.1] ShiftJIS to Unicode?

2008-11-27 Thread Walter Dörwald
Gilles Ganault wrote:
> Hello
> 
>   I'm trying to read pages from Amazon JP, whose web pages are
> supposed to be encoded in ShiftJIS, and decode contents into Unicode
> to keep Python happy:
> 
> www.amazon.co.jp
>  /> 
> 
> But this doesn't work:
> 
> ==
> m = try.search(the_page)
> if m:
>   #UnicodeEncodeError: 'charmap' codec can't encode characters in
> position 49-55: character maps to  
>   title = m.group(1).decode('shift_jis').strip()
> ==

There's something fishy going on: You're calling the decode method and
get a UnicodeEncodeError. This means that you're calling the decode
method on something that already *is* unicode. What does

   print type(m.group(1))

output?

Servus,
   Walter

--
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: XML builder for Python

2008-07-03 Thread Walter Dörwald

Jonas Galvez wrote:

Walter Dörwald wrote:

XIST has been using with blocks since version 3.0.
[...]
with xsc.Frag() as node:
  +xml.XML()
  +html.DocTypeXHTML10transitional()
  with html.html():
[...]


Sweet! I don't like having to use the unary operator tho, I wanted
something as simple as possible, so I wouldn't even have to assign a
variable on the with block ("as something").


You only have to assign the node a name in the outermost with block so 
that you can use the node object afterwards. But of course you can 
always implement the outermost __enter__/__exit__ in such a way, that 
the node gets written to an outputstream immediately.



I plan to add some
validation and error checking, but for generating feeds for my Atom
store it's reasonably fast and lean (just over 50 lines of code).


Servus,
   Walter
--
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: XML builder for Python

2008-07-03 Thread Walter Dörwald

Stefan Behnel wrote:

Hi,

Walter Dörwald wrote:

XIST has been using with blocks since version 3.0.

Take a look at:
http://www.livinglogic.de/Python/xist/Examples.html


from __future__ import with_statement

from ll.xist import xsc
from ll.xist.ns import html, xml, meta

with xsc.Frag() as node:
   +xml.XML()
   +html.DocTypeXHTML10transitional()
   with html.html():
  with html.head():
 +meta.contenttype()
 +html.title("Example page")
  with html.body():
 +html.h1("Welcome to the example page")
 with html.p():
+xsc.Text("This example page has a link to the ")
+html.a("Python home page", href="http://www.python.org/";)
+xsc.Text(".")

print node.conv().bytes(encoding="us-ascii")


Interesting. Is the "+" actually required? Are there other operators that make
sense here? I do not see what "~" or "-" could mean.


Of course the node constructor could append the node to the currently 
active element. However there might be cases where you want to do 
something else with the newly created node, so always appending the node 
is IMHO the wrong thing.


> Are there other operators that make
> sense here? I do not see what "~" or "-" could mean.
>
> Or is it just a technical constraint?

You need *one* operator/method that appends a node to the currently 
active block without opening another new block. This operator should be 
short to type and should have the right connotations. I find that unary 
+ is perfect for that.


> I'm asking because I consider adding such a syntax to lxml as a separate
> module. And I'd prefer copying an existing syntax over a (badly) home 
grown one.


"Existing syntax" might be a little exaggeration, I know of no other 
Python package that uses __pos__ for something similar. (But then again, 
I know of no other Python package that uses with block for generating 
XML ;)).


Servus,
   Walter

--
http://mail.python.org/mailman/listinfo/python-list


Re: ANN: XML builder for Python

2008-07-03 Thread Walter Dörwald

Stefan Behnel wrote:

Stefan Behnel wrote:

Jonas Galvez wrote:

Not sure if it's been done before, but still...

Obviously ;)

http://codespeak.net/lxml/tutorial.html#the-e-factory

... and tons of other tools that generate XML, check PyPI.


Although it might be the first time I see the with statement "misused" for
this. :)


XIST has been using with blocks since version 3.0.

Take a look at:
http://www.livinglogic.de/Python/xist/Examples.html


from __future__ import with_statement

from ll.xist import xsc
from ll.xist.ns import html, xml, meta

with xsc.Frag() as node:
   +xml.XML()
   +html.DocTypeXHTML10transitional()
   with html.html():
  with html.head():
 +meta.contenttype()
 +html.title("Example page")
  with html.body():
 +html.h1("Welcome to the example page")
 with html.p():
+xsc.Text("This example page has a link to the ")
+html.a("Python home page", href="http://www.python.org/";)
+xsc.Text(".")

print node.conv().bytes(encoding="us-ascii")

Servus,
   Walter
--
http://mail.python.org/mailman/listinfo/python-list


Re: convert xhtml back to html

2008-04-24 Thread Walter Dörwald

Arnaud Delobelle wrote:

"Tim Arnold" <[EMAIL PROTECTED]> writes:

hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to 
create  CHM files. That application really hates xhtml, so I need to convert 
self-ending tags (e.g. ) to plain html (e.g. ).


Seems simple enough, but I'm having some trouble with it. regexps trip up 
because I also have to take into account 'img', 'meta', 'link' tags, not 
just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do 
that with regexps, but my simpleminded )]+/> doesn't work. I'm not 
enough of a regexp pro to figure out that lookahead stuff.


Hi, I'm not sure if this is very helpful but the following works on
the very simple example below.


import re
xhtml = 'hello  spam  bye '
xtag = re.compile(r'<([^>]*?)/>') 
xtag.sub(r'<\1>', xhtml)

'hello  spam  bye '


You might try XIST (http://www.livinglogic.de/Python/xist):

Code looks like this:

from ll.xist import parsers
from ll.xist.ns import html

xhtml = 'hello  spam  bye '

doc = parsers.parsestring(xhtml)
print doc.bytes(xhtml=0)

This outputs:

hello  spam  bye 

(and a warning that the alt attribute is missing in the img ;))

Servus,
   Walter

--
http://mail.python.org/mailman/listinfo/python-list


Re: Generating HTML

2007-09-12 Thread Walter Dörwald
Sebastian Bassi wrote:

> Hello,
> 
> What are people using these days to generate HTML? I still use
> HTMLgen, but I want to know if there are new options. I don't
> want/need a web-framework a la Zope, just want to produce valid HTML
> from Python.

If you want something that works similar to HTMLgen, you could use XIST:
http://www.livinglogic.de/Python/xist/

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Replacement for HTMLGen?

2007-05-04 Thread Walter Dörwald
Joshua J. Kugler wrote:
> I realize that in today's MVC-everything world, the mere mention of
> generating HTML in the script is near heresy, but for now, it's what I ened
> to do. :)
> 
> That said, can someone recommend a good replacement for HTMLGen?  I've found
> good words about it (http://www.linuxjournal.com/article/2986), but every
> reference to it I find points to a non-existant page
> (http://starship.python.net/lib.html is 404,
> http://www.python2.net/lib.html is not responding,
> http://starship.python.net/crew/friedrich/HTMLgen/html/main.html is 404)
> Found http://www.python.org/ftp/python/contrib-09-Dec-1999/Network/, but
> that seems a bit old.
> 
> I found http://dustman.net/andy/python/HyperText, but it's not listed in
> Cheeseshop, and its latest release is over seven years ago.  Granted, I
> know HTML doesn't change (much) but it's at least nice to know something
> you're going to be using is maintained.
> 
> Any suggestions or pointers?

You might try XIST:
http://www.livinglogic.de/Python/xist/

Hope that helps!

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode error handler

2007-01-31 Thread Walter Dörwald
[EMAIL PROTECTED] wrote:
> On Jan 30, 11:28 pm, Walter Dörwald <[EMAIL PROTECTED]> wrote:
> 
>> codecs.register_error("transliterate", transliterate)
>>
>>Walter
> 
> Really, really slick solution.
> Though, why was it [:1], not [0]? ;-)

No particular reason, unicodedata.normalize("NFD", ...) should never
return an empty string.

> And one more thing:
>> def transliterate(exc):
>> if not isinstance(exc, UnicodeEncodeError):
>> raise TypeError("don'ty know how to handle %r" % r)
> I don't understand what %r and r are and where they are from. The man
> 3 printf page doesn't have %r formatting.

%r means format the repr() result, and r was supposed to be exc. ;)

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode error handler

2007-01-31 Thread Walter Dörwald
Martin v. Löwis wrote:

> Walter Dörwald schrieb:
>> You might try the following:
>>
>> # -*- coding: iso-8859-1 -*-
>>
>> import unicodedata, codecs
>>
>> def transliterate(exc):
>>  if not isinstance(exc, UnicodeEncodeError):
>>  raise TypeError("don'ty know how to handle %r" % r)
>>  return (unicodedata.normalize("NFD", exc.object[exc.start])[:1],
>> exc.start+1)
> 
> I think a number of special cases need to be studied here.
> I would expect that this is "semantically correct" if the characters
> being dropped are combining characters (at least in the languages I'm
> familiar with, it is common to drop them for transliteration).

True, it might make sense to limit the error handler to handling latin 
characters.

> However, if you do
> 
> py> for i in range(65536):
> ...   c = unicodedata.normalize("NFD", unichr(i))
> ...   for c2 in c[1:]:
> ... if not unicodedata.combining(c2): print hex(i),;break
> 
> you'll see that there are many characters which don't decompose
> into a base character + sequence of combining characters. In
> particular, this involves all hangul syllables (U+AC00..U+D7A3),
> for which it is just incorrect to drop the "jungseongs"
> (is that proper wording?).

Of course the above error handler only makes sense, when the decomposed 
codepoints are encodable in the target encoding. For your hangul example 
neither u"\ac00" nor the decomposed version u"\u1100\u1161" er encodable.

> There are also some cases which I'm completely uncertain about,
> e.g. ORIYA VOWEL SIGN AI decomposes to ORIYA VOWEL SIGN E +
> ORIYA AI LENGTH MARK. Is it correct to drop the length mark?
> It's not listed as a combining character. Likewise,
> MYANMAR LETTER UU decomposes to MYANMAR LETTER U +
> MYANMAR VOWEL SIGN II; same question here.

Servus,
Walter

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode error handler

2007-01-30 Thread Walter Dörwald
Rares Vernica wrote:
> Hi,
> 
> Does anyone know of any Unicode encode/decode error handler that does a 
> better replace job than the default replace error handler?
> 
> For example I have an iso-8859-1 string that has an 'e' with an accent 
> (you know, the French 'e's). When I use s.encode('ascii', 'replace') the 
> 'e' will be replaced with '?'. I would prefer to be replaced with an 'e' 
> even if I know it is not 100% correct.
> 
> If only this letter would be the problem I would do it manually, but 
> there is an entire set of letters that need to be replaced with their 
> closest ascii letter.
> 
> Is there an encode/decode error handler that can replace all the 
> not-ascii letters from iso-8859-1 with their closest ascii letter?

You might try the following:

# -*- coding: iso-8859-1 -*-

import unicodedata, codecs

def transliterate(exc):
if not isinstance(exc, UnicodeEncodeError):
raise TypeError("don'ty know how to handle %r" % r)
return (unicodedata.normalize("NFD", exc.object[exc.start])[:1],
exc.start+1)

codecs.register_error("transliterate", transliterate)

print u"Frédéric Chopin".encode("ascii", "transliterate")

Running this script gives you:
$ python transliterate.py
Frederic Chopin

Hope that helps.

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: urllib.unquote and unicode

2006-12-21 Thread Walter Dörwald
Martin v. Löwis wrote:
> Duncan Booth schrieb:
>> The way that uri encoding is supposed to work is that first the input
>> string in unicode is encoded to UTF-8 and then each byte which is not in
>> the permitted range for characters is encoded as % followed by two hex
>> characters. 
> 
> Can you back up this claim ("is supposed to work") by reference to
> a specification (ideally, chapter and verse)?
> 
> In URIs, it is entirely unspecified what the encoding is of non-ASCII
> characters, and whether % escapes denote characters in the first place.

http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is htmlGen still alive?

2006-12-19 Thread Walter Dörwald
[EMAIL PROTECTED] wrote:
> Does anybody know whether htmlGen, the Python-class library for
> generating HTML, is still being maintained? Or from where it can be
> downloaded? The Starship site where it used to be hosted is dead.

I don't know if HTMLgen is still alive, but if you're looking for
alternatives, you might give XIST a try
(http://www.livinglogic.de/Python/xist)

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python tools for managing static websites?

2006-10-31 Thread Walter Dörwald
Chris Pearl wrote:

> Are there Python tools to help webmasters manage static websites?
> 
> [...]

You might give XIST a try: http://www.livinglogic.de/Python/xist/

Basically XIST is an HTML generator, that can be extended to generate
the HTML you need for your site. The website
http://www.livinglogic.de/Python/ itself was generated with XIST. You
can find the source for the website here:
http://www.livinglogic.de/viewcvs/index.cgi/LivingLogic/WWW-Python/site/

Hope that helps!

Bye,
   Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode, bytes redux

2006-09-25 Thread Walter Dörwald
Steven D'Aprano wrote:
> On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote:
> 
>> willie <[EMAIL PROTECTED]> writes:
>>> # U+270C
>>> # 11100010 10011100 10001100
>>> buf = "\xE2\x9C\x8C"
>>> u = buf.decode('UTF-8')
>>> # ... later ...
>>> u.bytes() -> 3
>>>
>>> (goes through each code point and calculates
>>> the number of bytes that make up the character
>>> according to the encoding)
>> Duncan Booth explains why that doesn't work.  But I don't see any big
>> problem with a byte count function that lets you specify an encoding:
>>
>>  u = buf.decode('UTF-8')
>>  # ... later ...
>>  u.bytes('UTF-8') -> 3
>>  u.bytes('UCS-4') -> 4
>>
>> That avoids creating a new encoded string in memory, and for some
>> encodings, avoids having to scan the unicode string to add up the
>> lengths.
> 
> Unless I'm misunderstanding something, your bytes code would have to
> perform exactly the same algorithmic calculations as converting the
> encoded string in the first place, except it doesn't need to store the
> newly encoded string, merely the number of bytes of each character.
> 
> Here is a bit of pseudo-code that might do what you want:
> 
> def bytes(unistring, encoding):
> length = 0
> for c in unistring:
> length += len(c.encode(encoding))
> return length

That wouldn't work for stateful encodings:

>>> len(u"abc".encode("utf-16"))
8
>>> bytes(u"abc", "utf-16")
12

Use a stateful encoder instead:

import codecs
def bytes(unistring, encoding):
length = 0
enc = codecs.getincrementalencoder(encoding)()
for c in unistring:
length += len(enc.encode(c))
return length

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to get size of unicode string/string in bytes ?

2006-08-02 Thread Walter Dörwald
Diez B. Roggisch wrote:
>> So then the easiest thing to do is: take the maximum length of a unicode
>> string you could possibly want to store, multiply it by 4 and make that
>> the length of the DB field.
>  
>> However, I'm pretty convinced it is a bad idea to store Python unicode
>> strings directly in a DB, especially as they are not portable. I assume
>> that some DB connectors honour the local platform encoding already, but
>> I'd still say that UTF-8 is your best friend here.
> 
> It was your assumption that the OP wanted to store the "real"
> unicode-strings. A moot point anyway, at it is afaik not possible to get
> their contents in byte form (except from a C-extension).

It is possible:

>>> u"a\xff\u\U0010".encode("unicode-internal")
'a\x00\xff\x00\xff\xff\xff\xdb\xff\xdf'

This encoding is useless though, as you can't use it for reencoding on
another platform. (And it's probably not what the OP intended.)

> And assuming 4 bytes per character is a bit dissipative I'd say - especially
> when you have some > 80% ascii-subset in your text as european and american
> languages have.

That would require UTF-32 as an encoding, which Python currently doesn't
have.

> The solution was given before: chose an encoding (utf-8 is certainly the
> most favorable one), and compute the byte-string length.

Exactly!

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Having problems with strings in HTML

2006-06-27 Thread Walter Dörwald
Richard Brodie wrote:
> "Sion Arrowsmith" <[EMAIL PROTECTED]> wrote in message 
> news:[EMAIL PROTECTED]
> 
>>> By the way, you _do_ realize that your "&" characters should be escaped
>>> as "&", don't you?
>> No they shouldn't. They part of the url, which is (IIRC) a CDATA
>> attribute of the A element, not PCDATA.
> 
> It is CDATA but ampersands still need to be escaped. 

Exactly. See
http://www.w3.org/TR/html4/appendix/notes.html#ampersands-in-uris

Bye,
   Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a good programming text editor (not IDE)

2006-06-16 Thread Walter Dörwald
[EMAIL PROTECTED] wrote:

> John Salerno wrote:
> [snip]
>> Thanks for any suggestions, and again I'm sorry if this feels like the
>> same question as usual (it's just that in my case, I'm not looking for
>> something like SPE, Komodo, Eric3, etc. right now).
> 
> I was taking a peek at c.l.py to check for replies in another thread
> and couldn't help notice your asking about editors.  Please pardon the
> personal pimping, but have you looked at PyPE (pype.sf.net)?

I tried it out and the first problem I noticed is that on Windows
opening a file from a Samba drive doesn't seem to work, as PyPE converts
the filename to lowercase.

Servus,
   Walter

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: curses event handling

2006-06-07 Thread Walter Dörwald
John Hunter wrote:
> I have a curses app that is displaying real time data.  I would like
> to bind certain keys to certain functions, but do not want to block
> waiting for 
>  
>   c = screen.getch() 
> 
> Is it possible to register callbacks with curses, something like
> 
>   screen.register('keypress', myfunc)

You could use curses.halfdelay(), so that screen.getch() doesn't block
indefinitely. I'm not sure if this will be fast enough for your application.

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTMLParser fragility

2006-04-06 Thread Walter Dörwald
Rene Pijlman wrote:
> Lawrence D'Oliveiro:
>> I've been using HTMLParser to scrape Web sites. The trouble with this 
>> is, there's a lot of malformed HTML out there. Real browsers have to be 
>> written to cope gracefully with this, but HTMLParser does not. 
> 
> There are two solutions to this:
> 
> 1. Tidy the source before parsing it.
> http://www.egenix.com/files/python/mxTidy.html
> 
> 2. Use something more foregiving, like BeautifulSoup.
> http://www.crummy.com/software/BeautifulSoup/

You can also use the HTML parser from libxml2 or any of the available
wrappers for it.

Bye,
   Walter Dörwald

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [ANN] markup.py - 1.2 - an HTML/XML generator

2006-04-04 Thread Walter Dörwald
Peter Hansen wrote:
> Felipe Almeida Lessa wrote:
>> $ pwd
>> /usr/lib/python2.4/site-packages
>> $ grep -re klass . | wc -l
>> 274
>> $ grep -re class_ . | wc -l
>> 897
> 
> How many of those "class_" instances are really just substrings of 
> "__class__" and "class_name" and such?  On my machine, I see a handful 
> in the standard library, and _none_ in site-packages (which has only 
> 1709 .py files, mind you).
> 
>> For me that's enough. "class_" is used at least three times more than
>> "klass". Besides, as Scott pointed out, "class_" is prefered by the
>> guidelines too.
> 
> Actually what he posted explicitly states that "cls" is preferred. 
> Following that it says that one should considering appending _ if the 
> name conflicts with a keyword (and one can assume it means "for all 
> keywords other than class").

No, I think what it means is this: "Use cls as the name of the first
argument in a classmethod. For anything else (i.e. name that are not the
first argument in a classmethod) append an _, if it clashes with a
Python keyword.". So class_ is perfectly OK, if the Python argument maps
to the HTML attribute name.

Bye,
   Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problems (X and X)

2006-03-24 Thread Walter Dörwald
Duncan Booth wrote:

> [...]
> Unfortunately, just as I finished writing this I discovered that the 
> latscii module isn't as robust as I thought, it blows up on consecutive 
> accented characters. 
> 
>  :(

Replace the error handler with this (untested) and it should work with
consecutive accented characters:

def latscii_error( uerr ):
v = []
for c in uerr.object[uerr.start:uerr.end]
key = ord(c)
try:
v.append(unichr(decoding_map[key]))
except KeyError:
v.append(u"?")
return (u"".join(v), uerr.end)
codecs.register_error('replacelatscii', latscii_error)

Bye,
   Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode question

2006-03-01 Thread Walter Dörwald
Edward Loper wrote:

> Walter Dörwald wrote:
>> Edward Loper wrote:
>>
>>> [...]
>>> Surely there's a better way than converting back and forth 3 times?  Is
>>> there a reason that the 'backslashreplace' error mode can't be used 
>>> with codecs.decode?
>>>
>>>  >>> 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace')
>>> Traceback (most recent call last):
>>>File "", line 1, in ?
>>> TypeError: don't know how to handle UnicodeDecodeError in error callback
>>
>> The backslashreplace error handler is an *error* *handler*, i.e. it 
>> gives you a replacement text if an input character can't be encoded. 
>> But a backslash character in an 8bit string is no error, so it won't 
>> get replaced on decoding.
> 
> I'm not sure I follow exactly -- the input string I gave as an example 
> did not contain any backslash characters.  Unless by "backslash 
> character" you mean a character c such that ord(c)>127.  I guess it 
> depends on which class of errors you think the error handler should be 
> handling. :)  The codec system's pretty complex, so I'm willing to
> accept on faith that there may be a good reason to have error handlers 
> only make replacements in the encode direction, and not in the decode 
> direction.

Both directions are completely non-symmetric. On encoding an error can 
only happen when the character is unencodable (e.g. for charmap codecs 
anything outside the set of 256 characters). On decoding an error means 
that the byte stream violates the internal format of the encoding. But a 
0x5c byte (i.e. a backslash) in e.g. a latin-1 byte sequence doesn't 
violate the internal format of the latin-1 encoding (nothing does), so 
the error handler never kicks in.

>> What you want is a different codec (try e.g. "string-escape" or 
>> "unicode-escape").
> 
> This is very close, but unfortunately won't quite work for my purposes, 
> because it also puts backslashes before "'" and "\\" and maybe a few 
> other characters.  :-/

OK, seems you're stuck with your decode/encode/decode call.

>  >>> print "test: '\xff'".encode('string-escape').decode('ascii')
> test: \'\xff\'
> 
>  >>> print do_what_i_want("test:\xff'")
> test: '\xff'
> 
> I think I'll just have to stick with rolling my own.

Bye,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: unicode question

2006-02-27 Thread Walter Dörwald
Edward Loper wrote:

> [...]
> Surely there's a better way than converting back and forth 3 times?  Is
> there a reason that the 'backslashreplace' error mode can't be used with 
> codecs.decode?
> 
>  >>> 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace')
> Traceback (most recent call last):
>File "", line 1, in ?
> TypeError: don't know how to handle UnicodeDecodeError in error callback

The backslashreplace error handler is an *error* *handler*, i.e. it 
gives you a replacement text if an input character can't be encoded. But 
a backslash character in an 8bit string is no error, so it won't get 
replaced on decoding.

What you want is a different codec (try e.g. "string-escape" or 
"unicode-escape").

Bye,
Walter Dörwald

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: print UTF-8 file with BOM

2005-12-23 Thread Walter Dörwald
John Bauman wrote:

> UTF-8 shouldn't need a BOM, as it is designed for character streams, and 
> there is only one logical ordering of the bytes. Only UTF-16 and greater 
> should output a BOM, AFAIK. 

However there's a pending patch (http://bugs.python.org/1177307) for a 
new encoding named utf-8-sig, that would output a leading BOM on writing 
and skip it on reading.

Bye,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML DOM: XML/XHTML inside a text node

2005-11-04 Thread Walter Dörwald
[EMAIL PROTECTED] wrote:
> In my program, I get input from the user and insert it into an XHTML
> document.  Sometimes, this input will contain XHTML, but since I'm
> inserting it as a text node, xml.dom.minidom escapes the angle brackets
> ('<' becomes '<', '>' becomes '>').  I want to be able to
> override this behavior cleanly.  I know I could pipe the input through
> a SAX parser and create nodes to insert into the tree, but that seems
> kind of messy.  Is there a better way?

You could try version 2.13 of XIST (http://www.livinglogic.de/Python/xist)

Code looks like this:

from ll.xist.ns import html, specials

text = "Number 1 ... the larch"

e = html.div(
html.h1("And now for something completely different"),
html.p(specials.literal(text))
)
print e.asBytes()


This prints:
And now for something completely differentNumber 1 ... 
the larch

I hope this is what you need.

Bye,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Need a spider library

2005-10-12 Thread Walter Dörwald
Laszlo Zsolt Nagy wrote:

> [...]
> For example this malformed link:
> 
> http://samplesite.current_location/page.html','Samle link']

Your options AFAIK are:
* Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/)
* Various implementations of tidy (uTidyLib, mxTidy)
* XIST (http://www.livinglogic.de/Python/xist)

For XIST code that extracts the above info from a HTML page looks like this:

import sys
from ll import url
from ll.xist import parsers
from ll.xist.ns import html

def links(u):
node = parsers.parseURL(u, tidy=True, base=None)
for x in node//html.a:
   yield str(x["href"]), str(u/str(x["href"])), unicode(x)

for data in links(url.URL(sys.argv[1])):
print data

This outputs something like:

('http://www.python.org/', 'http://www.python.org/', u'\r\n')
('http://www.python.org/search/', 'http://www.python.org/search/', 
u'Search')
('http://www.python.org/download/', 'http://www.python.org/download/', 
u'Download')
('http://www.python.org/doc/', 'http://www.python.org/doc/', 
u'Documentation')
...

Hope that helps,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: cgi, reusing html. common problem?

2005-09-01 Thread Walter Dörwald
John M. Gabriele wrote:

> I'm putting together a small site using Python and cgi.
> 
> (I'm pretty new to this, but I've worked a little with
> JSP/servlets/Java before.)
> 
> Almost all pages on the site will share some common (and
> static) html, however, they'll also have dynamic aspects.
> I'm guessing that the common way to build sites like this
> is to have every page (which contains active content) be
> generated by a cgi script, but also have some text files
> hanging around containing incomplete html fragments which
> you read and paste-in as-needed (I'm thinking:
> header.html.txt, footer.html.txt, and so on).
> 
> Is that how it's usually done? If not, what *is* the
> usual way of handling this?

I don't know if it's the *usual* way, but you could give XIST a try 
(http://www.livinglogic.de/Python/xist). It was developed for exactly 
this purpose: You implement reusable HTML fragments in Python and you 
can use any kind of embedded dynamic language (PHP and JSP are supported 
out of the box).

Bye,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python html

2005-08-19 Thread Walter Dörwald
Steve Young wrote:

> Hi, I am looking for something where I can go through
> a html page and make change the url's for all the
> links, images, href's, etc... easily. If anyone knows
> of something, please let me know. Thanks.

You might try XIST (http://www.livinglogic.de/Python/xist)

Code might look like this:

from ll.xist import xsc, parsers

node = parsers.parseURL("http://www.python.org/";, tidy=True)

for link in node//xsc.URLAttr:
link[:] = unicode(link).replace(
   "http://www.python.org/";,
   "http://www.perl.org/";
)
print node.asBytes()

Bye,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Syntax error after upgrading to Python 2.4

2005-08-10 Thread Walter Dörwald
[EMAIL PROTECTED] wrote:

> Hi,
> 
> After upgrading to 2.4 (from 2.3), I'm getting a weird syntax error:
> 
> 
>>>>import themes
> 
> Traceback (most recent call last):
>   File "", line 1, in ?
>   File "themes.py", line 564
> font = self.font.makeBlackAndWhite(),
>   additive = self.additive,
>  ^
> SyntaxError: invalid syntax
> 
> The relevant code is:
> 
> def makeBlackAndWhite( self ):
> 
> return CharStyle( names = self.names,
>   basedOn = self.basedOn.makeBlackAndWhite(),
>   font = self.font.makeBlackAndWhite(),
>   additive = self.additive,
>   prefixText = self.prefixText )
> 
> This is a method in the CharStyle class which returns a new modified
> instance of CharStyle.
> 
> I'm using Windows XP and Python 2.4.1
> 
> Any ideas? O:-)

This is probably related to http://www.python.org/sf/1163244. Do you 
have a PEP 263 encoding declaration in your file? Can you try 
Lib/codecs.py from current CVS?

Bye,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trimming X/HTML files

2005-07-28 Thread Walter Dörwald
Thomas SMETS wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> 
> Dear,
> 
> I need to parse XHTML/HTML files in all ways :
> ~ _ Removing comments and javascripts is a first issue
> ~ _ Retrieving the list of fields to submit is my following item (todo)
> 
> Any idea where I could find this already made ... ?

You could try XIST (http://www.livinglogic.de/Python/xist).

Removing comments and javascripts works like this:

---
from ll.xist import xsc, parsers
from ll.xist.ns import html

e = parsers.parseURL("http://www.python.org/";, tidy=True)

def removestuff(node, converter):
if isinstance(node, xsc.Comment):
   node = xsc.Null
elif isinstance(node, html.script) and \
 (unicode(node["type"]) == u"text/javascript" or \
  unicode(node["language"]) == u"Javascript" \
 ):
node = xsc.Null
return node

e = e.mapped(removestuff)

print e.asBytes()
---

Retrieving the list of fields from all forms on a page might look like this:

---
from ll.xist import xsc, parsers, xfind
from ll.xist.ns import html

e = parsers.parseURL("http://www.python.org/";, tidy=True)

for form in e//html.form:
print "Fields for %s" % form["action"]
for field in form//xfind.is_(html.input, html.textarea):
   if "id" in field.attrs:
  print "\t%s" % field["id"]
   else:
  print "\t%s" % field["name"]
---

This prints:

Fields for http://www.google.com/search
q
domains
sitesearch
sourceid
submit

Hope that helps!

Bye,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: what is __init__.py used for?

2005-07-05 Thread Walter Dörwald
[EMAIL PROTECTED] wrote:

> I am a new learner of Python Programming Language.
> Now. I am reading a book.
> In the  section relating to module, I see an example.
> the directory tree looks like below:
> root\
> system1\
> __init__.py
> utilities.py
> main.py
> other.py
> system2\
> __init__.py
> utilities.py
> main.py
> other.py
> system3\ # Here or elsewhere
> __init__.py   # Your new code here
> myfile.py
> 
> question
> ==
>I was wonderring ... what is the __init__.py used for ?
>This question may seems to be stupid for an expert.
>But, if you can give the answer, it will be helpful for me.

If the root directory is on the Python search path, you can do "import 
system2.other" or "from system2 import other", to import the other.py 
module. But you can also do "import system2". This means that the source 
code for the system2 module has to live somewhere. __init.py inside the 
directory with the same name is this "somewhere". Without this 
__init__.py inside the system2 directoy you couldn't import other.py 
because Python doesn't know where the source code for system2 lives and 
refuses to treat system2 as a package.

Hope that helps,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: MySQL: 'latin-1' codec can't encode character

2005-05-13 Thread Walter Dörwald
Fredrik Lundh wrote:

> [...]
> if you want more control of the replacement, you can skip the translate
> step and use your own error handler, e.g.
> 
> charmap = ... see above ...
> 
> def fixunicode(info):
> s = info.object[info.start:info.end]
> try:
> return charmap[ord(s)], info.end

This will fail if there's more than one consecutive unencodable 
character, better use
return charmap[ord(s[0])], info.start+1
or
return "".join(charmap.get(ord(c), u"" % ord(c)) for c in 
s), info.end
(without the try:) instead.

Bye,
Walter Dörwald
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTML cleaner?

2005-04-25 Thread Walter Dörwald
Ivan Voras wrote:
M.-A. Lemburg wrote:
Not true: mxTidy integrates tidy as C lib. It's not an interface
to the command line tool.
Thanks, I'll look at it again!
Another option might be the HTML parser (libxml2.htmlReadMemory()) from 
libxml2 (http://www.xmlsoft.org)

Bye,
   Walter Dörwald
--
http://mail.python.org/mailman/listinfo/python-list


Re: xmlproc maintainer?

2005-03-18 Thread Walter Dörwald
Alban Hertroys wrote:
We recently (about a week ago) sent a patch to the maintainer of 
xmlproc, but we didn't receive a reply yet. A look at the site reveals 
that the last update was somewhere in 2000.

Does anybody know who the current maintainer is (if that changed), or 
what the status of xmlproc is? We kind of depend on it...

The patch fixes a buffering problem if the XML contains utf-8 codes, 
which gets especially problematic if one such character pair starts as 
the last byte in the buffer... Patch attached, in case someone can use it.
This should no longer be an issue with Python 2.4, because the stateful 
UTF-8 and UTF-16 decoder have been fixed to support incomplete input.

Unfortunately xmlproc doesn't seem to use the stateful decoder but the 
stateless decoder (and even handcrafted decoders when the codecs module 
doesn exist). Adding support for this might be a little tricky, because 
the parser must determine which encoding to use before instantiating the 
decoder.

Bye,
   Walter Dörwald
--
http://mail.python.org/mailman/listinfo/python-list


Re: unicode encoding usablilty problem

2005-02-18 Thread Walter Dörwald
aurora wrote:
> [...]
In Java they are distinct data type and the compiler would catch all  
incorrect usage. In Python, the interpreter seems to 'help' us to 
promote  binary string to unicode. Things works fine, unit tests pass, 
all until  the first non-ASCII characters come in and then the program 
breaks.

Is there a scheme for Python developer to use so that they are safe 
from  incorrect mixing?
Put the following:
import sys
sys.setdefaultencoding("undefined")
in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.
HTH,
   Walter Dörwald
--
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble with the encoding of os.getcwd() in Korean Windows

2005-02-09 Thread Walter Dörwald
Erik Bethke wrote:
Hello All,
sorry for all the posts... I am *almost* there now...
okay I have this code:
import sys, os
  encoding = locale.getpreferredencoding()
  htmlpath = os.getcwd()
  htmlpath = htmlpath.decode( encoding )
You might want to try os.getcwdu() instead of this. According to
http://www.python.org/doc/2.4/lib/os-file-dir.html
this has been added in Python 2.3 and should work on Windows.
Bye,
   Walter Dörwald
--
http://mail.python.org/mailman/listinfo/python-list


Re: Unicode universe (was Re: Dr. Dobb's Python-URL! - weekly Python news and links (Dec 30))

2005-01-04 Thread Walter Dörwald
Skip Montanaro wrote:
aahz> Here's the stark simple recipe: when you use Unicode, you *MUST*
aahz> switch to a Unicode-centric view of the universe.  Therefore you
aahz> encode *FROM* Unicode and you decode *TO* Unicode.  Period.  It's
aahz> similar to the way floating point contaminates ints.
That's what I do in my code.  Why do Unicode objects have a decode method
then?
Because MAL implemented it! >;->
It first encodes in the default encoding and then decodes the result
with the specified encoding, so if u is a unicode object
   u.decode("utf-16")
is an abbreviation of
   u.encode().decode("utf-16")
In the same way str has an encode method, so
   s.encode("utf-16")
is an abbreviation of
   s.decode().encode("utf-16")
Bye,
   Walter Dörwald
--
http://mail.python.org/mailman/listinfo/python-list


Re: Small Problem P 2.4 (line>2048 Bytes)

2004-12-15 Thread Walter Dörwald
>> [...]
>> After search, I had found that the problem come from a "long line" (more
>> than 2048 caracters), with begin :
>> mappingcharmaj = { chr(97):'A', chr(98):'B', chr(99):'C', ...
>> 
>> And, if I "break" in multiples lines, the problem is solved.

This sounds like bug http://www.python.org/sf/1076985
"Incorrect behaviour of StreamReader.readline leads to crash".

Are you using a PEP 263 coding header for your script?

Bye,
   Walter Dörwald
   

--
http://mail.python.org/mailman/listinfo/python-list