Re: Python 2.6 StreamReader.readline()
On 25.07.12 08:09, Ulrich Eckhardt wrote: Am 24.07.2012 17:01, schrieb cpppw...@gmail.com: reader = codecs.getreader(encoding) lines = [] with open(filename, 'rb') as f: lines = reader(f, 'strict').readlines(keepends=False) where encoding == 'utf-16-be' Everything works fine, except that lines[0] is equal to codecs.BOM_UTF16_BE Is this behaviour correct, that the BOM is still present? Yes, assuming the first line only contains that BOM. Technically it's a space character, and why should those be removed? If the first "character" in the file is a BOM the file encoding is probably not utf-16-be but utf-16. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Issues with `codecs.register` and `codecs.CodecInfo` objects
On 07.07.12 04:56, Steven D'Aprano wrote: On Fri, 06 Jul 2012 12:55:31 -0400, Karl Knechtel wrote: Hello all, While attempting to make a wrapper for opening multiple types of UTF-encoded files (more on that later, in a separate post, I guess), I ran into some oddities with the `codecs` module, specifically to do with `.register` ing `CodecInfo` objects. I'd like to report a bug or something, but there are several intertangled issues here and I'm not really sure how to report it so I thought I'd open the discussion. Apologies in advance if I get a bit rant-y, and a warning that this is fairly long. [...] Yes, it's a strangely indirect API, and yes it looks like you have identified a whole bucket full of problems with it. And no, I don't know why that API was chosen. This API was chosen for backwards compatibility reasons when incremental encoders/decoders were introduced (in 2006). And yes: We missed the opportunity to clean that up to always use CodecInfo. Changing to a cleaner, more direct (sensible?) API would be a fairly big step. If you want to pursue this, the steps I recommend you take are: 1) understanding the reason for the old API (search the Internet and particularly the python-...@python.org archives); See e.g. http://mail.python.org/pipermail/patches/2006-March/019122.html 2) have a plan for how to avoid breaking code that relies on the existing API; 3) raise the issue on python-id...@python.org to gather feedback and see how much opposition or support it is likely to get; they'll suggest whether a bug report is sufficient or if you'll need a PEP; http://www.python.org/dev/peps/ If you can provide a patch and a test suite, you will have a much better chance of pushing it through. If not, you are reliant on somebody else who can being interested enough to do the work. And one last thing: any new functionality will simply *not* be considered for Python 2.x. Aim for Python 3.4, since the 2.x series is now in bug- fix only maintenance mode and the 3.3 beta is no longer accepting new functionality, only bug fixes. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Why are some unicode error handlers "encode only"?
On 11.03.12 15:37, Steven D'Aprano wrote: At least two standard error handlers are documented as working for encoding only: xmlcharrefreplace backslashreplace See http://docs.python.org/library/codecs.html#codec-base-classes and http://docs.python.org/py3k/library/codecs.html Why is this? I don't see why they shouldn't work for decoding as well. Because xmlcharrefreplace and backslashreplace are *error* handlers. However the bytes sequence b'〹' does *not* contain any bytes that are not decodable for e.g. the ASCII codec. So there are no errors to handle. Consider this example using Python 3.2: b"aaa--\xe9z--\xe9!--bbb".decode("cp932") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10: illegal multibyte sequence The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't or can't be supported? The byte sequence b'\xe9!' however is not something that would have been produced by the backslashreplace error handler. b'\\xe9!' (a sequence containing 5 bytes) would have been (and this probably would decode without any problems with the cp932 codec). # This doesn't actually work. b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace") => r'aaa--騷--\xe9\x21--bbb' and similarly for xmlcharrefreplace. This would require a postprocess step *after* the bytes have been decoded. This is IMHO out of scope for Python's codec machinery. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: replacing words in HTML file
On 28.04.10 15:02, james_027 wrote: > hi, > > Any idea how I can replace words in a html file? Meaning only the > content will get replace while the html tags, javascript, & css are > remain untouch. You could try XIST (http://www.livinglogic.de/Python/xist/): Example code: from ll.xist import xsc, parsers def p2p(node, converter): if isinstance(node, xsc.Text): node = node.replace("Python", "Parrot") node = node.replace("python", "parrot") return node node = parsers.parseurl("http://www.python.org/";, tidy=True) node = node.mapped(p2p) node.write(open("parrot_index.html", "wb")) Hope that helps! Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: how to write a unicode string to a file ?
On 17.10.09 08:28, Mark Tolonen wrote: > > "Kee Nethery" wrote in message > news:aaab63c6-6e44-4c07-b119-972d4f49e...@kagi.com... >> >> On Oct 16, 2009, at 5:49 PM, Stephen Hansen wrote: >> >>> On Fri, Oct 16, 2009 at 5:07 PM, Stef Mientki >>> wrote: >> >> snip >> >>> The thing is, I'd be VERY surprised (neigh, shocked!) if Excel can't >>> open a file that is in UTF8-- it just might need to be TOLD that its >>> utf8 when you go and open the file, as UTF8 looks just like ASCII -- >>> until it contains characters that can't be expressed in ASCII. But I >>> don't know what type of file it is you're saving. >> >> We found that UTF-16 was required for Excel. It would not "do the >> right thing" when presented with UTF-8. > > Excel seems to expect a UTF-8-encoded BOM (byte order mark) to correctly > decide a file is written in UTF-8. This worked for me: > > f=codecs.open('test.csv','wb','utf-8') > f.write(u'\ufeff') # write a BOM > f.write(u'马克,testing,123\r\n') > f.close() That can also be done with the utf-8-sig codec (which adds a BOM at the start on writing): f = codecs.open('test.csv','wb','utf-8-sig') f.write(u'马克,testing,123\r\n') f.close() See http://docs.python.org/library/codecs.html#module-encodings.utf_8_sig Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: HTMLgen???
On 16.10.09 05:44, alex23 wrote: > On Oct 15, 6:58 pm, an...@vandervlies.xs4all.nl wrote: >> Does HTMLgen (Robin Friedrich's) still exsist?? And, if so, where can it >> be found? > > If you're after an easy to use html generator, I highly recommend > Richard Jones' html[1] lib. It's new, supported and makes very nice > use of context managers. > > [1]: http://pypi.python.org/pypi/html Another alternative is XIST at http://www.livinglogic.de/Python/xist/ which supports more than simple HTML. Examples can be found here: http://www.livinglogic.de/Python/xist/Examples.html Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode issue
On 01.10.09 17:50, Rami Chowdhury wrote: > On Thu, 01 Oct 2009 08:10:58 -0700, Walter Dörwald > wrote: > >> On 01.10.09 16:09, Hyuga wrote: >>> On Sep 30, 3:34 am, gentlestone wrote: >>>> Why don't work this code on Python 2.6? Or how can I do this job? >>>> >>>> [snip _MAP] >>>> >>>> def downcode(name): >>>> """ >>>> >>> downcode(u"Žabovitá zmiešaná kaša") >>>> u'Zabovita zmiesana kasa' >>>> """ >>>> for key, value in _MAP.iteritems(): >>>> name = name.replace(key, value) >>>> return name >>> >>> Though C Python is pretty optimized under the hood for this sort of >>> single-character replacement, this still seems pretty inefficient >>> since you're calling replace for every character you want to map. I >>> think that a better approach might be something like: >>> >>> def downcode(name): >>> return ''.join(_MAP.get(c, c) for c in name) >>> >>> Or using string.translate: >>> >>> import string >>> def downcode(name): >>> table = string.maketrans( >>> 'ÀÁÂÃÄÅ...', >>> 'AA...') >>> return name.translate(table) >> >> Or even simpler: >> >> import unicodedata >> >> def downcode(name): >>return unicodedata.normalize("NFD", name)\ >> .encode("ascii", "ignore")\ >> .decode("ascii") >> >> Servus, >>Walter > > As I understand it, the "ignore" argument to str.encode *removes* the > undecodable characters, rather than replacing them with an ASCII > approximation. Is that correct? If so, wouldn't that rather defeat the > purpose? Yes, but any accented characters have been split into the base character and the combining accent via normalize() before, so only the accent gets removed. Of course non-decomposable characters will be removed completely, but it would be possible to replace .encode("ascii", "ignore").decode("ascii") with something like this: u"".join(c for c in name if unicodedata.category(c) == "Mn") Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode issue
On 01.10.09 16:09, Hyuga wrote: > On Sep 30, 3:34 am, gentlestone wrote: >> Why don't work this code on Python 2.6? Or how can I do this job? >> >> _MAP = { >> # LATIN >> u'À': 'A', u'Á': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Å': 'A', >> u'Æ': 'AE', u'Ç':'C', >> u'È': 'E', u'É': 'E', u'Ê': 'E', u'Ë': 'E', u'Ì': 'I', u'Í': 'I', >> u'Î': 'I', >> u'Ï': 'I', u'Ð': 'D', u'Ñ': 'N', u'Ò': 'O', u'Ó': 'O', u'Ô': 'O', >> u'Õ': 'O', u'Ö':'O', >> u'Ő': 'O', u'Ø': 'O', u'Ù': 'U', u'Ú': 'U', u'Û': 'U', u'Ü': 'U', >> u'Ű': 'U', >> u'Ý': 'Y', u'Þ': 'TH', u'ß': 'ss', u'à':'a', u'á':'a', u'â': 'a', >> u'ã': 'a', u'ä':'a', >> u'å': 'a', u'æ': 'ae', u'ç': 'c', u'è': 'e', u'é': 'e', u'ê': 'e', >> u'ë': 'e', >> u'ì': 'i', u'í': 'i', u'î': 'i', u'ï': 'i', u'ð': 'd', u'ñ': 'n', >> u'ò': 'o', u'ó':'o', >> u'ô': 'o', u'õ': 'o', u'ö': 'o', u'ő': 'o', u'ø': 'o', u'ù': 'u', >> u'ú': 'u', >> u'û': 'u', u'ü': 'u', u'ű': 'u', u'ý': 'y', u'þ': 'th', u'ÿ': 'y', >> # LATIN_SYMBOLS >> u'©':'(c)', >> # GREEK >> u'α':'a', u'β':'b', u'γ':'g', u'δ':'d', u'ε':'e', u'ζ':'z', >> u'η':'h', u'θ':'8', >> u'ι':'i', u'κ':'k', u'λ':'l', u'μ':'m', u'ν':'n', u'ξ':'3', >> u'ο':'o', u'π':'p', >> u'ρ':'r', u'σ':'s', u'τ':'t', u'υ':'y', u'φ':'f', u'χ':'x', >> u'ψ':'ps', u'ω':'w', >> u'ά':'a', u'έ':'e', u'ί':'i', u'ό':'o', u'ύ':'y', u'ή':'h', >> u'ώ':'w', u'ς':'s', >> u'ϊ':'i', u'ΰ':'y', u'ϋ':'y', u'ΐ':'i', >> u'Α':'A', u'Β':'B', u'Γ':'G', u'Δ':'D', u'Ε':'E', u'Ζ':'Z', >> u'Η':'H', u'Θ':'8', >> u'Ι':'I', u'Κ':'K', u'Λ':'L', u'Μ':'M', u'Ν':'N', u'Ξ':'3', >> u'Ο':'O', u'Π':'P', >> u'Ρ':'R', u'Σ':'S', u'Τ':'T', u'Υ':'Y', u'Φ':'F', u'Χ':'X', >> u'Ψ':'PS', u'Ω':'W', >> u'Ά':'A', u'Έ':'E', u'Ί':'I', u'Ό':'O', u'Ύ':'Y', u'Ή':'H', >> u'Ώ':'W', u'Ϊ':'I', u'Ϋ':'Y', >> # TURKISH >> u'ş':'s', u'Ş':'S', u'ı':'i', u'İ':'I', u'ç':'c', u'Ç':'C', >> u'ü':'u', u'Ü':'U', >> u'ö':'o', u'Ö':'O', u'ğ':'g', u'Ğ':'G', >> # RUSSIAN >> u'а':'a', u'б':'b', u'в':'v', u'г':'g', u'д':'d', u'е':'e', >> u'ё':'yo', u'ж':'zh', >> u'з':'z', u'и':'i', u'й':'j', u'к':'k', u'л':'l', u'м':'m', >> u'н':'n', u'о':'o', >> u'п':'p', u'р':'r', u'с':'s', u'т':'t', u'у':'u', u'ф':'f', >> u'х':'h', u'ц':'c', >> u'ч':'ch', u'ш':'sh', u'щ':'sh', u'ъ':'', u'ы':'y', u'ь':'', >> u'э':'e', u'ю':'yu', u'я':'ya', >> u'А':'A', u'Б':'B', u'В':'V', u'Г':'G', u'Д':'D', u'Е':'E', >> u'Ё':'Yo', u'Ж':'Zh', >> u'З':'Z', u'И':'I', u'Й':'J', u'К':'K', u'Л':'L', u'М':'M', >> u'Н':'N', u'О':'O', >> u'П':'P', u'Р':'R', u'С':'S', u'Т':'T', u'У':'U', u'Ф':'F', >> u'Х':'H', u'Ц':'C', >> u'Ч':'Ch', u'Ш':'Sh', u'Щ':'Sh', u'Ъ':'', u'Ы':'Y', u'Ь':'', >> u'Э':'E', u'Ю':'Yu', u'Я':'Ya', >> # UKRAINIAN >> u'Є':'Ye', u'І':'I', u'Ї':'Yi', u'Ґ':'G', u'є':'ye', u'і':'i', >> u'ї':'yi', u'ґ':'g', >> # CZECH >> u'č':'c', u'ď':'d', u'ě':'e', u'ň':'n', u'ř':'r', u'š':'s', >> u'ť':'t', u'ů':'u', >> u'ž':'z', u'Č':'C', u'Ď':'D', u'Ě':'E', u'Ň':'N', u'Ř':'R', >> u'Š':'S', u'Ť':'T', u'Ů':'U', u'Ž':'Z', >> # POLISH >> u'ą':'a', u'ć':'c', u'ę':'e', u'ł':'l', u'ń':'n', u'ó':'o', >> u'ś':'s', u'ź':'z', >> u'ż':'z', u'Ą':'A', u'Ć':'C', u'Ę':'e', u'Ł':'L', u'Ń':'N', >> u'Ó':'o', u'Ś':'S', >> u'Ź':'Z', u'Ż':'Z', >> # LATVIAN >> u'ā':'a', u'č':'c', u'ē':'e', u'ģ':'g', u'ī':'i', u'ķ':'k', >> u'ļ':'l', u'ņ':'n', >> u'š':'s', u'ū':'u', u'ž':'z', u'Ā':'A', u'Č':'C', u'Ē':'E', >> u'Ģ':'G', u'Ī':'i', >> u'Ķ':'k', u'Ļ':'L', u'Ņ':'N', u'Š':'S', u'Ū':'u', u'Ž':'Z' >> >> } >> >> def downcode(name): >> """ >> >>> downcode(u"Žabovitá zmiešaná kaša") >> u'Zabovita zmiesana kasa' >> """ >> for key, value in _MAP.iteritems(): >> name = name.replace(key, value) >> return name > > Though C Python is pretty optimized under the hood for this sort of > single-character replacement, this still seems pretty inefficient > since you're calling replace for every character you want to map. I > think that a better approach might be something like: > > def downcode(name): > return ''.join(_MAP.get(c, c) for c in name) > > Or using string.translate: > > import string > def downcode(name): > table = string.maketrans( > 'ÀÁÂÃÄÅ...', > 'AA...') > return name.translate(table) Or even simpler: import unicodedata def downcode(name): return unicodedata.normalize("NFD", name)\ .encode("ascii", "ignore")\ .decode("ascii") Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: >> "correct" -> "corrected" > > Thanks, fixed. > >>> To convert non-decodable bytes, a new error handler "python-escape" is >>> introduced, which decodes non-decodable bytes using into a private-use >>> character U+F01xx, which is believed to not conflict with private-use >>> characters that currently exist in Python codecs. >> Would this mean that real private use characters in the file name would >> raise an exception? How? The UTF-8 decoder doesn't pass those bytes to >> any error handler. > > The python-escape codec is only used/meaningful if the env encoding > is not UTF-8. For any other encoding, it is assumed that no character > actually maps to the private-use characters. Which should be true for any encoding from the pre-unicode era, but not for UTF-16/32 and variants. >>> The error handler interface is extended to allow the encode error >>> handler to return byte strings immediately, in addition to returning >>> Unicode strings which then get encoded again. >> Then the error callback for encoding would become specific to the target >> encoding. > > Why would it become specific? It can work the same way for any encoding: > take U+F01xx, and generate the byte xx. If any error callback emits bytes these byte sequences must be legal in the target encoding, which depends on the target encoding itself. However for the normal use of this error handler this might be irrelevant, because those filenames that get encoded were constructed in such a way that reencoding them regenerates the original byte sequence. >>> If the locale's encoding is UTF-8, the file system encoding is set to >>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes >>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. >> Is this done by the codec, or the error handler? If it's done by the >> codec I don't see a reason for the "python-escape" error handler. > > utf-8b is a new codec. However, the utf-8b codec is only used if the > env encoding would otherwise be utf-8. For utf-8b, the error handler > is indeed unnecessary. Wouldn't it make more sense to be consistent how non-decodable bytes get decoded? I.e. should the utf-8b codec decode those bytes to PUA characters too (and refuse to encode then, so the error handler outputs them)? >>> While providing a uniform API to non-decodable bytes, this interface >>> has the limitation that chosen representation only "works" if the data >>> get converted back to bytes with the python-escape error handler >>> also. >> I thought the error handler would be used for decoding. > > It's used in both directions: for decoding, it converts \xXX to > U+F01XX. For encoding, U+F01XX will trigger an error, which is then > handled by the handler to produce \xXX. But only for non-UTF8 encodings? Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: > I'm proposing the following PEP for inclusion into Python 3.1. > Please comment. > > Regards, > Martin > > PEP: 383 > Title: Non-decodable Bytes in System Character Interfaces > Version: $Revision: 71793 $ > Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $ > Author: Martin v. Löwis > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 22-Apr-2009 > Python-Version: 3.1 > Post-History: > > Abstract > > > File names, environment variables, and command line arguments are > defined as being character data in POSIX; the C APIs however allow > passing arbitrary bytes - whether these conform to a certain encoding > or not. This PEP proposes a means of dealing with such irregularities > by embedding the bytes in character strings in such a way that allows > recreation of the original byte string. > > Rationale > = > > The C char type is a data type that is commonly used to represent both > character data and bytes. Certain POSIX interfaces are specified and > widely understood as operating on character data, however, the system > call interfaces make no assumption on the encoding of these data, and > pass them on as-is. With Python 3, character strings use a > Unicode-based internal representation, making it difficult to ignore > the encoding of byte strings in the same way that the C interfaces can > ignore the encoding. > > On the other hand, Microsoft Windows NT has correct the original "correct" -> "corrected" > design limitation of Unix, and made it explicit in its system > interfaces that these data (file names, environment variables, command > line arguments) are indeed character data, by providing a > Unicode-based API (keeping a C-char-based one for backwards > compatibility). > > [...] > > Specification > = > > On Windows, Python uses the wide character APIs to access > character-oriented APIs, allowing direct conversion of the > environmental data to Python str objects. > > On POSIX systems, Python currently applies the locale's encoding to > convert the byte data to Unicode. If the locale's encoding is UTF-8, > it can represent the full set of Unicode characters, otherwise, only a > subset is representable. In the latter case, using private-use > characters to represent these bytes would be an option. For UTF-8, > doing so would create an ambiguity, as the private-use characters may > regularly occur in the input also. > > To convert non-decodable bytes, a new error handler "python-escape" is > introduced, which decodes non-decodable bytes using into a private-use > character U+F01xx, which is believed to not conflict with private-use > characters that currently exist in Python codecs. Would this mean that real private use characters in the file name would raise an exception? How? The UTF-8 decoder doesn't pass those bytes to any error handler. > The error handler interface is extended to allow the encode error > handler to return byte strings immediately, in addition to returning > Unicode strings which then get encoded again. Then the error callback for encoding would become specific to the target encoding. Would this mean that the handler checks which encoding is used and behaves like "strict" if it doesn't recognize the encoding? > If the locale's encoding is UTF-8, the file system encoding is set to > a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes > (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. Is this done by the codec, or the error handler? If it's done by the codec I don't see a reason for the "python-escape" error handler. > Discussion > == > > While providing a uniform API to non-decodable bytes, this interface > has the limitation that chosen representation only "works" if the data > get converted back to bytes with the python-escape error handler > also. I thought the error handler would be used for decoding. > Encoding the data with the locale's encoding and the (default) > strict error handler will raise an exception, encoding them with UTF-8 > will produce non-sensical data. > > For most applications, we assume that they eventually pass data > received from a system interface back into the same system > interfaces. For example, and application invoking os.listdir() will "and" -> "an" > likely pass the result strings back into APIs like os.stat() or > open(), which then encodes them back into their original byte > representation. Applications that need to process the original byte > strings can obtain them by encoding the character strings with the > file system encoding, passing "python-escape" as the error handler > name. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: [2.5.1] ShiftJIS to Unicode?
Gilles Ganault wrote: > Hello > > I'm trying to read pages from Amazon JP, whose web pages are > supposed to be encoded in ShiftJIS, and decode contents into Unicode > to keep Python happy: > > www.amazon.co.jp > /> > > But this doesn't work: > > == > m = try.search(the_page) > if m: > #UnicodeEncodeError: 'charmap' codec can't encode characters in > position 49-55: character maps to > title = m.group(1).decode('shift_jis').strip() > == There's something fishy going on: You're calling the decode method and get a UnicodeEncodeError. This means that you're calling the decode method on something that already *is* unicode. What does print type(m.group(1)) output? Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: ANN: XML builder for Python
Jonas Galvez wrote: Walter Dörwald wrote: XIST has been using with blocks since version 3.0. [...] with xsc.Frag() as node: +xml.XML() +html.DocTypeXHTML10transitional() with html.html(): [...] Sweet! I don't like having to use the unary operator tho, I wanted something as simple as possible, so I wouldn't even have to assign a variable on the with block ("as something"). You only have to assign the node a name in the outermost with block so that you can use the node object afterwards. But of course you can always implement the outermost __enter__/__exit__ in such a way, that the node gets written to an outputstream immediately. I plan to add some validation and error checking, but for generating feeds for my Atom store it's reasonably fast and lean (just over 50 lines of code). Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: ANN: XML builder for Python
Stefan Behnel wrote: Hi, Walter Dörwald wrote: XIST has been using with blocks since version 3.0. Take a look at: http://www.livinglogic.de/Python/xist/Examples.html from __future__ import with_statement from ll.xist import xsc from ll.xist.ns import html, xml, meta with xsc.Frag() as node: +xml.XML() +html.DocTypeXHTML10transitional() with html.html(): with html.head(): +meta.contenttype() +html.title("Example page") with html.body(): +html.h1("Welcome to the example page") with html.p(): +xsc.Text("This example page has a link to the ") +html.a("Python home page", href="http://www.python.org/";) +xsc.Text(".") print node.conv().bytes(encoding="us-ascii") Interesting. Is the "+" actually required? Are there other operators that make sense here? I do not see what "~" or "-" could mean. Of course the node constructor could append the node to the currently active element. However there might be cases where you want to do something else with the newly created node, so always appending the node is IMHO the wrong thing. > Are there other operators that make > sense here? I do not see what "~" or "-" could mean. > > Or is it just a technical constraint? You need *one* operator/method that appends a node to the currently active block without opening another new block. This operator should be short to type and should have the right connotations. I find that unary + is perfect for that. > I'm asking because I consider adding such a syntax to lxml as a separate > module. And I'd prefer copying an existing syntax over a (badly) home grown one. "Existing syntax" might be a little exaggeration, I know of no other Python package that uses __pos__ for something similar. (But then again, I know of no other Python package that uses with block for generating XML ;)). Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: ANN: XML builder for Python
Stefan Behnel wrote: Stefan Behnel wrote: Jonas Galvez wrote: Not sure if it's been done before, but still... Obviously ;) http://codespeak.net/lxml/tutorial.html#the-e-factory ... and tons of other tools that generate XML, check PyPI. Although it might be the first time I see the with statement "misused" for this. :) XIST has been using with blocks since version 3.0. Take a look at: http://www.livinglogic.de/Python/xist/Examples.html from __future__ import with_statement from ll.xist import xsc from ll.xist.ns import html, xml, meta with xsc.Frag() as node: +xml.XML() +html.DocTypeXHTML10transitional() with html.html(): with html.head(): +meta.contenttype() +html.title("Example page") with html.body(): +html.h1("Welcome to the example page") with html.p(): +xsc.Text("This example page has a link to the ") +html.a("Python home page", href="http://www.python.org/";) +xsc.Text(".") print node.conv().bytes(encoding="us-ascii") Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: convert xhtml back to html
Arnaud Delobelle wrote: "Tim Arnold" <[EMAIL PROTECTED]> writes: hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to create CHM files. That application really hates xhtml, so I need to convert self-ending tags (e.g. ) to plain html (e.g. ). Seems simple enough, but I'm having some trouble with it. regexps trip up because I also have to take into account 'img', 'meta', 'link' tags, not just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do that with regexps, but my simpleminded )]+/> doesn't work. I'm not enough of a regexp pro to figure out that lookahead stuff. Hi, I'm not sure if this is very helpful but the following works on the very simple example below. import re xhtml = 'hello spam bye ' xtag = re.compile(r'<([^>]*?)/>') xtag.sub(r'<\1>', xhtml) 'hello spam bye ' You might try XIST (http://www.livinglogic.de/Python/xist): Code looks like this: from ll.xist import parsers from ll.xist.ns import html xhtml = 'hello spam bye ' doc = parsers.parsestring(xhtml) print doc.bytes(xhtml=0) This outputs: hello spam bye (and a warning that the alt attribute is missing in the img ;)) Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Generating HTML
Sebastian Bassi wrote: > Hello, > > What are people using these days to generate HTML? I still use > HTMLgen, but I want to know if there are new options. I don't > want/need a web-framework a la Zope, just want to produce valid HTML > from Python. If you want something that works similar to HTMLgen, you could use XIST: http://www.livinglogic.de/Python/xist/ Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Replacement for HTMLGen?
Joshua J. Kugler wrote: > I realize that in today's MVC-everything world, the mere mention of > generating HTML in the script is near heresy, but for now, it's what I ened > to do. :) > > That said, can someone recommend a good replacement for HTMLGen? I've found > good words about it (http://www.linuxjournal.com/article/2986), but every > reference to it I find points to a non-existant page > (http://starship.python.net/lib.html is 404, > http://www.python2.net/lib.html is not responding, > http://starship.python.net/crew/friedrich/HTMLgen/html/main.html is 404) > Found http://www.python.org/ftp/python/contrib-09-Dec-1999/Network/, but > that seems a bit old. > > I found http://dustman.net/andy/python/HyperText, but it's not listed in > Cheeseshop, and its latest release is over seven years ago. Granted, I > know HTML doesn't change (much) but it's at least nice to know something > you're going to be using is maintained. > > Any suggestions or pointers? You might try XIST: http://www.livinglogic.de/Python/xist/ Hope that helps! Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
[EMAIL PROTECTED] wrote: > On Jan 30, 11:28 pm, Walter Dörwald <[EMAIL PROTECTED]> wrote: > >> codecs.register_error("transliterate", transliterate) >> >>Walter > > Really, really slick solution. > Though, why was it [:1], not [0]? ;-) No particular reason, unicodedata.normalize("NFD", ...) should never return an empty string. > And one more thing: >> def transliterate(exc): >> if not isinstance(exc, UnicodeEncodeError): >> raise TypeError("don'ty know how to handle %r" % r) > I don't understand what %r and r are and where they are from. The man > 3 printf page doesn't have %r formatting. %r means format the repr() result, and r was supposed to be exc. ;) Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
Martin v. Löwis wrote: > Walter Dörwald schrieb: >> You might try the following: >> >> # -*- coding: iso-8859-1 -*- >> >> import unicodedata, codecs >> >> def transliterate(exc): >> if not isinstance(exc, UnicodeEncodeError): >> raise TypeError("don'ty know how to handle %r" % r) >> return (unicodedata.normalize("NFD", exc.object[exc.start])[:1], >> exc.start+1) > > I think a number of special cases need to be studied here. > I would expect that this is "semantically correct" if the characters > being dropped are combining characters (at least in the languages I'm > familiar with, it is common to drop them for transliteration). True, it might make sense to limit the error handler to handling latin characters. > However, if you do > > py> for i in range(65536): > ... c = unicodedata.normalize("NFD", unichr(i)) > ... for c2 in c[1:]: > ... if not unicodedata.combining(c2): print hex(i),;break > > you'll see that there are many characters which don't decompose > into a base character + sequence of combining characters. In > particular, this involves all hangul syllables (U+AC00..U+D7A3), > for which it is just incorrect to drop the "jungseongs" > (is that proper wording?). Of course the above error handler only makes sense, when the decomposed codepoints are encodable in the target encoding. For your hangul example neither u"\ac00" nor the decomposed version u"\u1100\u1161" er encodable. > There are also some cases which I'm completely uncertain about, > e.g. ORIYA VOWEL SIGN AI decomposes to ORIYA VOWEL SIGN E + > ORIYA AI LENGTH MARK. Is it correct to drop the length mark? > It's not listed as a combining character. Likewise, > MYANMAR LETTER UU decomposes to MYANMAR LETTER U + > MYANMAR VOWEL SIGN II; same question here. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
Rares Vernica wrote: > Hi, > > Does anyone know of any Unicode encode/decode error handler that does a > better replace job than the default replace error handler? > > For example I have an iso-8859-1 string that has an 'e' with an accent > (you know, the French 'e's). When I use s.encode('ascii', 'replace') the > 'e' will be replaced with '?'. I would prefer to be replaced with an 'e' > even if I know it is not 100% correct. > > If only this letter would be the problem I would do it manually, but > there is an entire set of letters that need to be replaced with their > closest ascii letter. > > Is there an encode/decode error handler that can replace all the > not-ascii letters from iso-8859-1 with their closest ascii letter? You might try the following: # -*- coding: iso-8859-1 -*- import unicodedata, codecs def transliterate(exc): if not isinstance(exc, UnicodeEncodeError): raise TypeError("don'ty know how to handle %r" % r) return (unicodedata.normalize("NFD", exc.object[exc.start])[:1], exc.start+1) codecs.register_error("transliterate", transliterate) print u"Frédéric Chopin".encode("ascii", "transliterate") Running this script gives you: $ python transliterate.py Frederic Chopin Hope that helps. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: urllib.unquote and unicode
Martin v. Löwis wrote: > Duncan Booth schrieb: >> The way that uri encoding is supposed to work is that first the input >> string in unicode is encoded to UTF-8 and then each byte which is not in >> the permitted range for characters is encoded as % followed by two hex >> characters. > > Can you back up this claim ("is supposed to work") by reference to > a specification (ideally, chapter and verse)? > > In URIs, it is entirely unspecified what the encoding is of non-ASCII > characters, and whether % escapes denote characters in the first place. http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1 Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Is htmlGen still alive?
[EMAIL PROTECTED] wrote: > Does anybody know whether htmlGen, the Python-class library for > generating HTML, is still being maintained? Or from where it can be > downloaded? The Starship site where it used to be hosted is dead. I don't know if HTMLgen is still alive, but if you're looking for alternatives, you might give XIST a try (http://www.livinglogic.de/Python/xist) Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Python tools for managing static websites?
Chris Pearl wrote: > Are there Python tools to help webmasters manage static websites? > > [...] You might give XIST a try: http://www.livinglogic.de/Python/xist/ Basically XIST is an HTML generator, that can be extended to generate the HTML you need for your site. The website http://www.livinglogic.de/Python/ itself was generated with XIST. You can find the source for the website here: http://www.livinglogic.de/viewcvs/index.cgi/LivingLogic/WWW-Python/site/ Hope that helps! Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode, bytes redux
Steven D'Aprano wrote: > On Mon, 25 Sep 2006 00:45:29 -0700, Paul Rubin wrote: > >> willie <[EMAIL PROTECTED]> writes: >>> # U+270C >>> # 11100010 10011100 10001100 >>> buf = "\xE2\x9C\x8C" >>> u = buf.decode('UTF-8') >>> # ... later ... >>> u.bytes() -> 3 >>> >>> (goes through each code point and calculates >>> the number of bytes that make up the character >>> according to the encoding) >> Duncan Booth explains why that doesn't work. But I don't see any big >> problem with a byte count function that lets you specify an encoding: >> >> u = buf.decode('UTF-8') >> # ... later ... >> u.bytes('UTF-8') -> 3 >> u.bytes('UCS-4') -> 4 >> >> That avoids creating a new encoded string in memory, and for some >> encodings, avoids having to scan the unicode string to add up the >> lengths. > > Unless I'm misunderstanding something, your bytes code would have to > perform exactly the same algorithmic calculations as converting the > encoded string in the first place, except it doesn't need to store the > newly encoded string, merely the number of bytes of each character. > > Here is a bit of pseudo-code that might do what you want: > > def bytes(unistring, encoding): > length = 0 > for c in unistring: > length += len(c.encode(encoding)) > return length That wouldn't work for stateful encodings: >>> len(u"abc".encode("utf-16")) 8 >>> bytes(u"abc", "utf-16") 12 Use a stateful encoder instead: import codecs def bytes(unistring, encoding): length = 0 enc = codecs.getincrementalencoder(encoding)() for c in unistring: length += len(enc.encode(c)) return length Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: how to get size of unicode string/string in bytes ?
Diez B. Roggisch wrote: >> So then the easiest thing to do is: take the maximum length of a unicode >> string you could possibly want to store, multiply it by 4 and make that >> the length of the DB field. > >> However, I'm pretty convinced it is a bad idea to store Python unicode >> strings directly in a DB, especially as they are not portable. I assume >> that some DB connectors honour the local platform encoding already, but >> I'd still say that UTF-8 is your best friend here. > > It was your assumption that the OP wanted to store the "real" > unicode-strings. A moot point anyway, at it is afaik not possible to get > their contents in byte form (except from a C-extension). It is possible: >>> u"a\xff\u\U0010".encode("unicode-internal") 'a\x00\xff\x00\xff\xff\xff\xdb\xff\xdf' This encoding is useless though, as you can't use it for reencoding on another platform. (And it's probably not what the OP intended.) > And assuming 4 bytes per character is a bit dissipative I'd say - especially > when you have some > 80% ascii-subset in your text as european and american > languages have. That would require UTF-32 as an encoding, which Python currently doesn't have. > The solution was given before: chose an encoding (utf-8 is certainly the > most favorable one), and compute the byte-string length. Exactly! Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Having problems with strings in HTML
Richard Brodie wrote: > "Sion Arrowsmith" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > >>> By the way, you _do_ realize that your "&" characters should be escaped >>> as "&", don't you? >> No they shouldn't. They part of the url, which is (IIRC) a CDATA >> attribute of the A element, not PCDATA. > > It is CDATA but ampersands still need to be escaped. Exactly. See http://www.w3.org/TR/html4/appendix/notes.html#ampersands-in-uris Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: a good programming text editor (not IDE)
[EMAIL PROTECTED] wrote: > John Salerno wrote: > [snip] >> Thanks for any suggestions, and again I'm sorry if this feels like the >> same question as usual (it's just that in my case, I'm not looking for >> something like SPE, Komodo, Eric3, etc. right now). > > I was taking a peek at c.l.py to check for replies in another thread > and couldn't help notice your asking about editors. Please pardon the > personal pimping, but have you looked at PyPE (pype.sf.net)? I tried it out and the first problem I noticed is that on Windows opening a file from a Samba drive doesn't seem to work, as PyPE converts the filename to lowercase. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: curses event handling
John Hunter wrote: > I have a curses app that is displaying real time data. I would like > to bind certain keys to certain functions, but do not want to block > waiting for > > c = screen.getch() > > Is it possible to register callbacks with curses, something like > > screen.register('keypress', myfunc) You could use curses.halfdelay(), so that screen.getch() doesn't block indefinitely. I'm not sure if this will be fast enough for your application. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: HTMLParser fragility
Rene Pijlman wrote: > Lawrence D'Oliveiro: >> I've been using HTMLParser to scrape Web sites. The trouble with this >> is, there's a lot of malformed HTML out there. Real browsers have to be >> written to cope gracefully with this, but HTMLParser does not. > > There are two solutions to this: > > 1. Tidy the source before parsing it. > http://www.egenix.com/files/python/mxTidy.html > > 2. Use something more foregiving, like BeautifulSoup. > http://www.crummy.com/software/BeautifulSoup/ You can also use the HTML parser from libxml2 or any of the available wrappers for it. Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: [ANN] markup.py - 1.2 - an HTML/XML generator
Peter Hansen wrote: > Felipe Almeida Lessa wrote: >> $ pwd >> /usr/lib/python2.4/site-packages >> $ grep -re klass . | wc -l >> 274 >> $ grep -re class_ . | wc -l >> 897 > > How many of those "class_" instances are really just substrings of > "__class__" and "class_name" and such? On my machine, I see a handful > in the standard library, and _none_ in site-packages (which has only > 1709 .py files, mind you). > >> For me that's enough. "class_" is used at least three times more than >> "klass". Besides, as Scott pointed out, "class_" is prefered by the >> guidelines too. > > Actually what he posted explicitly states that "cls" is preferred. > Following that it says that one should considering appending _ if the > name conflicts with a keyword (and one can assume it means "for all > keywords other than class"). No, I think what it means is this: "Use cls as the name of the first argument in a classmethod. For anything else (i.e. name that are not the first argument in a classmethod) append an _, if it clashes with a Python keyword.". So class_ is perfectly OK, if the Python argument maps to the HTML attribute name. Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: encoding problems (X and X)
Duncan Booth wrote: > [...] > Unfortunately, just as I finished writing this I discovered that the > latscii module isn't as robust as I thought, it blows up on consecutive > accented characters. > > :( Replace the error handler with this (untested) and it should work with consecutive accented characters: def latscii_error( uerr ): v = [] for c in uerr.object[uerr.start:uerr.end] key = ord(c) try: v.append(unichr(decoding_map[key])) except KeyError: v.append(u"?") return (u"".join(v), uerr.end) codecs.register_error('replacelatscii', latscii_error) Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode question
Edward Loper wrote: > Walter Dörwald wrote: >> Edward Loper wrote: >> >>> [...] >>> Surely there's a better way than converting back and forth 3 times? Is >>> there a reason that the 'backslashreplace' error mode can't be used >>> with codecs.decode? >>> >>> >>> 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace') >>> Traceback (most recent call last): >>>File "", line 1, in ? >>> TypeError: don't know how to handle UnicodeDecodeError in error callback >> >> The backslashreplace error handler is an *error* *handler*, i.e. it >> gives you a replacement text if an input character can't be encoded. >> But a backslash character in an 8bit string is no error, so it won't >> get replaced on decoding. > > I'm not sure I follow exactly -- the input string I gave as an example > did not contain any backslash characters. Unless by "backslash > character" you mean a character c such that ord(c)>127. I guess it > depends on which class of errors you think the error handler should be > handling. :) The codec system's pretty complex, so I'm willing to > accept on faith that there may be a good reason to have error handlers > only make replacements in the encode direction, and not in the decode > direction. Both directions are completely non-symmetric. On encoding an error can only happen when the character is unencodable (e.g. for charmap codecs anything outside the set of 256 characters). On decoding an error means that the byte stream violates the internal format of the encoding. But a 0x5c byte (i.e. a backslash) in e.g. a latin-1 byte sequence doesn't violate the internal format of the latin-1 encoding (nothing does), so the error handler never kicks in. >> What you want is a different codec (try e.g. "string-escape" or >> "unicode-escape"). > > This is very close, but unfortunately won't quite work for my purposes, > because it also puts backslashes before "'" and "\\" and maybe a few > other characters. :-/ OK, seems you're stuck with your decode/encode/decode call. > >>> print "test: '\xff'".encode('string-escape').decode('ascii') > test: \'\xff\' > > >>> print do_what_i_want("test:\xff'") > test: '\xff' > > I think I'll just have to stick with rolling my own. Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode question
Edward Loper wrote: > [...] > Surely there's a better way than converting back and forth 3 times? Is > there a reason that the 'backslashreplace' error mode can't be used with > codecs.decode? > > >>> 'abc \xff\xe8 def'.decode('ascii', 'backslashreplace') > Traceback (most recent call last): >File "", line 1, in ? > TypeError: don't know how to handle UnicodeDecodeError in error callback The backslashreplace error handler is an *error* *handler*, i.e. it gives you a replacement text if an input character can't be encoded. But a backslash character in an 8bit string is no error, so it won't get replaced on decoding. What you want is a different codec (try e.g. "string-escape" or "unicode-escape"). Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: print UTF-8 file with BOM
John Bauman wrote: > UTF-8 shouldn't need a BOM, as it is designed for character streams, and > there is only one logical ordering of the bytes. Only UTF-16 and greater > should output a BOM, AFAIK. However there's a pending patch (http://bugs.python.org/1177307) for a new encoding named utf-8-sig, that would output a leading BOM on writing and skip it on reading. Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: XML DOM: XML/XHTML inside a text node
[EMAIL PROTECTED] wrote: > In my program, I get input from the user and insert it into an XHTML > document. Sometimes, this input will contain XHTML, but since I'm > inserting it as a text node, xml.dom.minidom escapes the angle brackets > ('<' becomes '<', '>' becomes '>'). I want to be able to > override this behavior cleanly. I know I could pipe the input through > a SAX parser and create nodes to insert into the tree, but that seems > kind of messy. Is there a better way? You could try version 2.13 of XIST (http://www.livinglogic.de/Python/xist) Code looks like this: from ll.xist.ns import html, specials text = "Number 1 ... the larch" e = html.div( html.h1("And now for something completely different"), html.p(specials.literal(text)) ) print e.asBytes() This prints: And now for something completely differentNumber 1 ... the larch I hope this is what you need. Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: Need a spider library
Laszlo Zsolt Nagy wrote: > [...] > For example this malformed link: > > http://samplesite.current_location/page.html','Samle link'] Your options AFAIK are: * Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) * Various implementations of tidy (uTidyLib, mxTidy) * XIST (http://www.livinglogic.de/Python/xist) For XIST code that extracts the above info from a HTML page looks like this: import sys from ll import url from ll.xist import parsers from ll.xist.ns import html def links(u): node = parsers.parseURL(u, tidy=True, base=None) for x in node//html.a: yield str(x["href"]), str(u/str(x["href"])), unicode(x) for data in links(url.URL(sys.argv[1])): print data This outputs something like: ('http://www.python.org/', 'http://www.python.org/', u'\r\n') ('http://www.python.org/search/', 'http://www.python.org/search/', u'Search') ('http://www.python.org/download/', 'http://www.python.org/download/', u'Download') ('http://www.python.org/doc/', 'http://www.python.org/doc/', u'Documentation') ... Hope that helps, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: cgi, reusing html. common problem?
John M. Gabriele wrote: > I'm putting together a small site using Python and cgi. > > (I'm pretty new to this, but I've worked a little with > JSP/servlets/Java before.) > > Almost all pages on the site will share some common (and > static) html, however, they'll also have dynamic aspects. > I'm guessing that the common way to build sites like this > is to have every page (which contains active content) be > generated by a cgi script, but also have some text files > hanging around containing incomplete html fragments which > you read and paste-in as-needed (I'm thinking: > header.html.txt, footer.html.txt, and so on). > > Is that how it's usually done? If not, what *is* the > usual way of handling this? I don't know if it's the *usual* way, but you could give XIST a try (http://www.livinglogic.de/Python/xist). It was developed for exactly this purpose: You implement reusable HTML fragments in Python and you can use any kind of embedded dynamic language (PHP and JSP are supported out of the box). Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: python html
Steve Young wrote: > Hi, I am looking for something where I can go through > a html page and make change the url's for all the > links, images, href's, etc... easily. If anyone knows > of something, please let me know. Thanks. You might try XIST (http://www.livinglogic.de/Python/xist) Code might look like this: from ll.xist import xsc, parsers node = parsers.parseURL("http://www.python.org/";, tidy=True) for link in node//xsc.URLAttr: link[:] = unicode(link).replace( "http://www.python.org/";, "http://www.perl.org/"; ) print node.asBytes() Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: Syntax error after upgrading to Python 2.4
[EMAIL PROTECTED] wrote: > Hi, > > After upgrading to 2.4 (from 2.3), I'm getting a weird syntax error: > > >>>>import themes > > Traceback (most recent call last): > File "", line 1, in ? > File "themes.py", line 564 > font = self.font.makeBlackAndWhite(), > additive = self.additive, > ^ > SyntaxError: invalid syntax > > The relevant code is: > > def makeBlackAndWhite( self ): > > return CharStyle( names = self.names, > basedOn = self.basedOn.makeBlackAndWhite(), > font = self.font.makeBlackAndWhite(), > additive = self.additive, > prefixText = self.prefixText ) > > This is a method in the CharStyle class which returns a new modified > instance of CharStyle. > > I'm using Windows XP and Python 2.4.1 > > Any ideas? O:-) This is probably related to http://www.python.org/sf/1163244. Do you have a PEP 263 encoding declaration in your file? Can you try Lib/codecs.py from current CVS? Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: Trimming X/HTML files
Thomas SMETS wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > > Dear, > > I need to parse XHTML/HTML files in all ways : > ~ _ Removing comments and javascripts is a first issue > ~ _ Retrieving the list of fields to submit is my following item (todo) > > Any idea where I could find this already made ... ? You could try XIST (http://www.livinglogic.de/Python/xist). Removing comments and javascripts works like this: --- from ll.xist import xsc, parsers from ll.xist.ns import html e = parsers.parseURL("http://www.python.org/";, tidy=True) def removestuff(node, converter): if isinstance(node, xsc.Comment): node = xsc.Null elif isinstance(node, html.script) and \ (unicode(node["type"]) == u"text/javascript" or \ unicode(node["language"]) == u"Javascript" \ ): node = xsc.Null return node e = e.mapped(removestuff) print e.asBytes() --- Retrieving the list of fields from all forms on a page might look like this: --- from ll.xist import xsc, parsers, xfind from ll.xist.ns import html e = parsers.parseURL("http://www.python.org/";, tidy=True) for form in e//html.form: print "Fields for %s" % form["action"] for field in form//xfind.is_(html.input, html.textarea): if "id" in field.attrs: print "\t%s" % field["id"] else: print "\t%s" % field["name"] --- This prints: Fields for http://www.google.com/search q domains sitesearch sourceid submit Hope that helps! Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: what is __init__.py used for?
[EMAIL PROTECTED] wrote: > I am a new learner of Python Programming Language. > Now. I am reading a book. > In the section relating to module, I see an example. > the directory tree looks like below: > root\ > system1\ > __init__.py > utilities.py > main.py > other.py > system2\ > __init__.py > utilities.py > main.py > other.py > system3\ # Here or elsewhere > __init__.py # Your new code here > myfile.py > > question > == >I was wonderring ... what is the __init__.py used for ? >This question may seems to be stupid for an expert. >But, if you can give the answer, it will be helpful for me. If the root directory is on the Python search path, you can do "import system2.other" or "from system2 import other", to import the other.py module. But you can also do "import system2". This means that the source code for the system2 module has to live somewhere. __init.py inside the directory with the same name is this "somewhere". Without this __init__.py inside the system2 directoy you couldn't import other.py because Python doesn't know where the source code for system2 lives and refuses to treat system2 as a package. Hope that helps, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: MySQL: 'latin-1' codec can't encode character
Fredrik Lundh wrote: > [...] > if you want more control of the replacement, you can skip the translate > step and use your own error handler, e.g. > > charmap = ... see above ... > > def fixunicode(info): > s = info.object[info.start:info.end] > try: > return charmap[ord(s)], info.end This will fail if there's more than one consecutive unencodable character, better use return charmap[ord(s[0])], info.start+1 or return "".join(charmap.get(ord(c), u"" % ord(c)) for c in s), info.end (without the try:) instead. Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: HTML cleaner?
Ivan Voras wrote: M.-A. Lemburg wrote: Not true: mxTidy integrates tidy as C lib. It's not an interface to the command line tool. Thanks, I'll look at it again! Another option might be the HTML parser (libxml2.htmlReadMemory()) from libxml2 (http://www.xmlsoft.org) Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: xmlproc maintainer?
Alban Hertroys wrote: We recently (about a week ago) sent a patch to the maintainer of xmlproc, but we didn't receive a reply yet. A look at the site reveals that the last update was somewhere in 2000. Does anybody know who the current maintainer is (if that changed), or what the status of xmlproc is? We kind of depend on it... The patch fixes a buffering problem if the XML contains utf-8 codes, which gets especially problematic if one such character pair starts as the last byte in the buffer... Patch attached, in case someone can use it. This should no longer be an issue with Python 2.4, because the stateful UTF-8 and UTF-16 decoder have been fixed to support incomplete input. Unfortunately xmlproc doesn't seem to use the stateful decoder but the stateless decoder (and even handcrafted decoders when the codecs module doesn exist). Adding support for this might be a little tricky, because the parser must determine which encoding to use before instantiating the decoder. Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode encoding usablilty problem
aurora wrote: > [...] In Java they are distinct data type and the compiler would catch all incorrect usage. In Python, the interpreter seems to 'help' us to promote binary string to unicode. Things works fine, unit tests pass, all until the first non-ASCII characters come in and then the program breaks. Is there a scheme for Python developer to use so that they are safe from incorrect mixing? Put the following: import sys sys.setdefaultencoding("undefined") in a file named sitecustomize.py somewhere in your Python path and Python will complain whenever there's an implicit conversion between str and unicode. HTH, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: Trouble with the encoding of os.getcwd() in Korean Windows
Erik Bethke wrote: Hello All, sorry for all the posts... I am *almost* there now... okay I have this code: import sys, os encoding = locale.getpreferredencoding() htmlpath = os.getcwd() htmlpath = htmlpath.decode( encoding ) You might want to try os.getcwdu() instead of this. According to http://www.python.org/doc/2.4/lib/os-file-dir.html this has been added in Python 2.3 and should work on Windows. Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode universe (was Re: Dr. Dobb's Python-URL! - weekly Python news and links (Dec 30))
Skip Montanaro wrote: aahz> Here's the stark simple recipe: when you use Unicode, you *MUST* aahz> switch to a Unicode-centric view of the universe. Therefore you aahz> encode *FROM* Unicode and you decode *TO* Unicode. Period. It's aahz> similar to the way floating point contaminates ints. That's what I do in my code. Why do Unicode objects have a decode method then? Because MAL implemented it! >;-> It first encodes in the default encoding and then decodes the result with the specified encoding, so if u is a unicode object u.decode("utf-16") is an abbreviation of u.encode().decode("utf-16") In the same way str has an encode method, so s.encode("utf-16") is an abbreviation of s.decode().encode("utf-16") Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list
Re: Small Problem P 2.4 (line>2048 Bytes)
>> [...] >> After search, I had found that the problem come from a "long line" (more >> than 2048 caracters), with begin : >> mappingcharmaj = { chr(97):'A', chr(98):'B', chr(99):'C', ... >> >> And, if I "break" in multiples lines, the problem is solved. This sounds like bug http://www.python.org/sf/1076985 "Incorrect behaviour of StreamReader.readline leads to crash". Are you using a PEP 263 coding header for your script? Bye, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list