Informative thanks Jerry, however I'm not out of the woods yet. > Here's a couple of questions that you'll need to answer 'Yes' to > before you're going to get this to work reliably: > > Are you familiar with the differences between byte strings and unicode > strings?
I think so, although I'm probably missing key bits of information. > Do you understand how to convert from one to the other, > using a particular encoding? No not really. This is something that's still very new to me. > Do you know what encoding your source > file is saved in? The name of the file I'm trying to open comes from a UTF-16 encoded text file, I'm then using regex to extract the string (filename) I need to open. However, all the examples I've been using here are just typed into the python console, meaning string source at this stage is largely irrelevant. > If your string is not coming from a source file, > but some other source of bytes, do you know what encoding those bytes > are using? > > Try the following. Before trying to convert filename to unicode, do a > "print repr(filename)". That will show you the byte string, along > with the numeric codes for the non-ascii parts. Then convert those > bytes to a unicode object using the appropriate encoding. If the > bytes are utf-8, then you'd do something like this: > unicode_filename = unicode(filename, 'utf-8') >>> print(repr(filename)) "This is_a-test'FILE to Ensure$ that\x9c stuff^ works.txt.js" >>> fileName = unicode(filename, 'utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 35: invalid start byte >>> fileName = unicode(filename, 'utf-16') >>> fileName u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u736e\u7275\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u7874\u2e74\u736a' So I now have a UTF-16 encoded string, but I still can't open it. >>> codecs.open(fileName, 'r', 'utf-16') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\codecs.py", line 881, in open file = __builtin__.open(filename, mode, buffering) IOError: [Errno 2] No such file or directory: u'\u6854\u7369\u6920\u5f73\u2d61\u6574\u7473\u4627\u4c49\u2045\u6f74\u4520\u736e\u72 75\u2465\u7420\u6168\u9c74\u7320\u7574\u6666\u205e\u6f77\u6b72\u2e73\u7874\u2e74\u736a' I presume I need to perform some kind of decode operation on it to open the file but then am I not basically going back to my starting point? Apologies if I'm missing the obvious. -- James _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor