Brilliant! It worked. Thanks! Here is the final code for those who are struggling with similar problems:
## open and decode file # In this case, the encoding comes from the charset argument in a meta tag # e.g. <meta charset="iso-8859-2"> fileObj = open(filePath,"r").read() fileContent = fileObj.decode("iso-8859-2") fileSoup = BeautifulSoup(fileContent) ## Do some BeautifulSoup magic and preserve unicode, presume result is saved in 'text' ## ## write extracted text to file f = open(outFilePath, 'w') f.write(text.encode('utf-8')) f.close() On Oct 5, 11:40 pm, Steven D'Aprano <steve +comp.lang.pyt...@pearwood.info> wrote: > On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote: > > Hi, I am having some encoding problems when I first parse stuff from a > > non-english website using BeautifulSoup and then write the results to a > > txt file. > > If you haven't already read this, you should do so: > > http://www.joelonsoftware.com/articles/Unicode.html > > > I have the text both as a normal (text) and as a unicode string (utext): > > print repr(text) > > 'Branie zak\xc2\xb3adnik\xc3\xb3w' > > This is pretty much meaningless, because we don't know how you got the > text and what it actually is. You're showing us a bunch of bytes, with no > clue as to whether they are the right bytes or not. Considering that your > Unicode text is also incorrect, I would say it is *not* right and your > description of the problem is 100% backwards: the problem is not > *writing* the text, but *reading* the bytes and decoding it. > > You should do something like this: > > (1) Inspect the web page to find out what encoding is actually used. > > (2) If the web page doesn't know what encoding it uses, or if it uses > bits and pieces of different encodings, then the source is broken and you > shouldn't expect much better results. You could try guessing, but you > should expect mojibake in your results. > > http://en.wikipedia.org/wiki/Mojibake > > (3) Decode the web page into Unicode text, using the correct encoding. > > (4) Do all your processing in Unicode, not bytes. > > (5) Encode the text into bytes using UTF-8 encoding. > > (6) Write the bytes to a file. > > [...] > > > Now I am trying to save this to a file but I never get the encoding > > right. Here is what I tried (+ lot's of different things with encode, > > decode...): > > outFile=codecs.open( filePath, "w", "UTF8" ) > > outFile.write(utext) > > outFile.close() > > That's the correct approach, but it won't help you if utext contains the > wrong characters in the first place. The critical step is taking the > bytes in the web page and turning them into text. > > How are you generating utext? > > -- > Steven -- http://mail.python.org/mailman/listinfo/python-list