Paulo da Silva wrote: > Em 12-01-2014 16:23, Peter Otten escreveu: >> Paulo da Silva wrote: >> >>> I am using a python3 script to produce a bash script from lots of >>> filenames got using os.walk. >>> >>> I have a template string for each bash command in which I replace a >>> special string with the filename and then write the command to the bash >>> script file. >>> >>> Something like this: >>> >>> shf=open(bashfilename,'w') >>> filenames=getfilenames() # uses os.walk >>> for fn in filenames: >>> ... >>> cmd=templ.replace("<fn>",fn) >>> shf.write(cmd) >>> >>> For certain filenames I got a UnicodeEncodeError exception at >>> shf.write(cmd)! >>> I use utf-8 and have # -*- coding: utf-8 -*- in the source .py. >>> >>> How can I fix this? >>> >>> Thanks for any help/comments. >> >> You make it harder to debug your problem by not giving the complete >> traceback. If the error message contains 'surrogates not allowed' like in >> the demo below >> >>>>> with open("tmp.txt", "w") as f: >> ... f.write("\udcef") >> ... >> Traceback (most recent call last): >> File "<stdin>", line 2, in <module> >> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in >> position 0: surrogates not allowed > > That is the situation. I just lost it and it would take a few houres to > repeat the situation. Sorry. > > >> >> you have filenames that are not valid UTF-8 on your harddisk. >> >> A possible fix would be to use bytes instead of str. For that you need to >> open `bashfilename` in binary mode ("wb") and pass bytes to the os.walk() >> call. > This is my 1st time with python3, so I am confused! > > As much I could understand it seems that os.walk is returning the > filenames exactly as they are on disk. Just bytes like in C.
No, they are decoded with the preferred encoding. With UTF-8 that can fail, and if it does the surrogateescape error handler replaces the offending bytes with special codepoints: >>> import os >>> with open(b"\xe4\xf6\xfc", "w") as f: f.write("whatever") ... 8 >>> os.listdir() ['\udce4\udcf6\udcfc'] You can bypass the decoding process by providing a bytes argument to os.listdir() (or os.walk() which uses os.listdir() internally): >>> os.listdir(b".") [b'\xe4\xf6\xfc'] To write these raw bytes into a file the file has of course to be binary, too. > My template is a string. What is the result of the replace command? Is > there any change in the filename from os.walk contents? > > Now, if the result of the replace has the replaced filename unchanged > how do I "convert" it to bytes type, without changing its contents, so > that I can write to the bashfile opened with "wb"? > > >> >> Or you just go and fix the offending names. > This is impossible in my case. > I need a bash script with the names as they are on disk. I think instead of the hard way sketched out above it will be sufficient to specify the error handler when opening the destination file shf = open(bashfilename, 'w', errors="surrogateescape") but I have not tried it myself. Also, some bytes may need to be escaped, either to be understood by the shell, or to address security concerns: >>> import os >>> template = "ls <fn>" >>> for filename in os.listdir(): ... print(template.replace("<fn>", filename)) ... ls foo; rm bar -- https://mail.python.org/mailman/listinfo/python-list