lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out..
so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning my regex/xml processing works fine, until i tested my tool on some french page with "non ascii" chars and my script started to throw errors all over the place.. I've looked into the matter and discovered the unicode / string encoding processes implied when dealing with non ascii texts and i must say i almost lost my mind.. I'm loosing it actually.. so here are the few questions i'd like to have answers for : 1. when fetching a web page from the net, how am i supposed to know how it's encoded.. And can i decode it to unicode and encode it back to a byte string so i can use it in my code, with the charsets i want, like utf-8.. ? 2. in the same idea could anyone try to post the few lines that would actually parse an xml file, with non ascii chars, with minidom (parseString i guess). Then convert a string grabbed from the net so parts of it can be inserted in that dom object into new nodes or existing nodes. And finally write that dom object back to a file in a way it can be used again later with the same script.. I've been trying to do that for a few days with no luck.. I can do each separate part of the job, not that i'm quite sure how i decode/encode stuff in there, but as soon as i try to do everything at the same time i get encoding errors thrown all the time.. 3. in order to help me understand what's going on when doing encodes/decodes could you please tell me if in the following example, s and backToBytes are actually the same thing ?? s = "hello normal string" u = unicode( s, "utf-8" ) backToBytes = u.encode( "utf-8" ) i knwo they both are bytestrings but i doubt they have actually the same content.. 4. I've also tried to set the default encoding of python for my script using the sys.setdefaultencoding('utf-8') but it keeps telling me that this module does not have that method.. i'm left no choice but to edit the site.py file manually to change "ascii" to "utf-8", but i won't be able to do that on the client computers so.. Anyways i don't know if it would help my script at all.. any help will be greatly appreciated thx Marc -- http://mail.python.org/mailman/listinfo/python-list