Τη Πέμπτη, 6 Ιουνίου 2013 1:24:16 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: > On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= > <nikos.gr...@gmail.com> wrote: > > | Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε: > > | > Using Python, I think you could get the filenames using os.listdir, > > | > passing the directory name as a bytestring so that it'll return the > > | > names as bytestrings. > > | > > | > Then, for each name, you could decode from its current encoding and > > | > encode to UTF-8 and rename the file, passing the old and new paths to > > | > os.rename as bytestrings. > > | > > | Iam not sure i follow: > > | > > | Change this: > > | > > | # Compute a set of current fullpaths > > | fullpaths = set() > > | path = "/home/nikos/public_html/data/apps/" > > | > > | for root, dirs, files in os.walk(path): > > [...] > > > > Have a read of this: > > > > http://docs.python.org/3/library/os.html#os.listdir > > > > The UNIX API accepts bytes for filenames and paths. > > > > Python 3 strs are sequences of Unicode code points. If you try to > > open a file or directory on a UNIX system using a Python str, that > > string must be converted to a sequence of bytes before being handed > > to the OS. > > > > This is done implicitly using your locale settings if you just use a str. > > > > However, if you pass a bytes to open or listdir, this conversion > > does not take place. You put bytes in and in the case of listdir > > you get bytes out. > > > > You can work on pathnames in bytes and never concern yourself with > > encode/decode at all. > > > > In this way you can write code that does not care about the translation > > between Unicode and some arbitrary byte encoding. > > > > Of course, the issue will still arise when accepting user input; > > your shell has done exactly this kind of thing when you renamed > > your MP3 file. But it is possible to write pure utility code that > > doesn't care about filenames as Unicode or str if you work purely > > in bytes.
> > Regarding user filenames, the common policy these days is to use > > utf-8 throughout. Of course you need to get everything into that > > regime to start with Τη Πέμπτη, 6 Ιουνίου 2013 1:24:16 μ.μ. UTC+3, ο χρήστης Cameron Simpson έγραψε: > On 05Jun2013 11:43, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= > <nikos.gr...@gmail.com> wrote: > > | Τη Τετάρτη, 5 Ιουνίου 2013 9:32:15 μ.μ. UTC+3, ο χρήστης MRAB έγραψε: > > | > Using Python, I think you could get the filenames using os.listdir, > > | > passing the directory name as a bytestring so that it'll return the > > | > names as bytestrings. > > | > > | > Then, for each name, you could decode from its current encoding and > > | > encode to UTF-8 and rename the file, passing the old and new paths to > > | > os.rename as bytestrings. > > | > > | Iam not sure i follow: > > | > > | Change this: > > | > > | # Compute a set of current fullpaths > > | fullpaths = set() > > | path = "/home/nikos/public_html/data/apps/" > > | > > | for root, dirs, files in os.walk(path): > > [...] > > > > Have a read of this: > > > > http://docs.python.org/3/library/os.html#os.listdir > > > > The UNIX API accepts bytes for filenames and paths. > > > > Python 3 strs are sequences of Unicode code points. If you try to > > open a file or directory on a UNIX system using a Python str, that > > string must be converted to a sequence of bytes before being handed > > to the OS. > > > > This is done implicitly using your locale settings if you just use a str. > > > > However, if you pass a bytes to open or listdir, this conversion > > does not take place. You put bytes in and in the case of listdir > > you get bytes out. > > > > You can work on pathnames in bytes and never concern yourself with > > encode/decode at all. > > > > In this way you can write code that does not care about the translation > > between Unicode and some arbitrary byte encoding. > > > > Of course, the issue will still arise when accepting user input; > > your shell has done exactly this kind of thing when you renamed > > your MP3 file. But it is possible to write pure utility code that > > doesn't care about filenames as Unicode or str if you work purely > > in bytes. > > > > Regarding user filenames, the common policy these days is to use > > utf-8 throughout. Of course you need to get everything into that > > regime to start with. So i i nee to use os.listdir() to grab those filenames into bytes. okey. So by changing this to: fullpaths = set() path = "/home/nikos/public_html/data/apps/" for root, dirs, files in os.walk(path): for fullpath in files: fullpaths.add( os.path.join(root, fullpath) ) # Compute a set of current fullpaths fullpaths = os.listdir( '/home/nikos/public_html/data/apps/' ) # Load'em for fullpath in fullpaths: try: # Check the presence of a file against the database and insert if it doesn't exist cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) ) data = cur.fetchone() #URL is unique, so should only be one ----------------------------- [Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] Original exception was: [Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] Traceback (most recent call last): [Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] File "files.py", line 67, in <module> [Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) ) [Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] File "/usr/local/lib/python3.3/site-packages/PyMySQL3-0.5-py3.3.egg/pymysql/cursors.py", line 108, in execute [Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] query = query.encode(charset) [Thu Jun 06 14:15:38 2013] [error] [client 79.103.41.173] UnicodeEncodeError: 'utf-8' codec can't encode character '\\udcc5' in position 35: surrogates not allowed -- http://mail.python.org/mailman/listinfo/python-list