Re: os.walk the apostrophe and unicode
On Sun, 25 Jun 2017 08:18:45 -0600 Michael Torriewrote: > On 06/25/2017 06:19 AM, Rod Person wrote: > > But doing a simple ls of that directory show it is unicode but the > > replacement of the offending character. > > > > http://rodperson.com/graphics/uc/ls.png > > Now that is really strange. Your OS seems to not recognize that the > filename is in UTF-8. I suspect this has something to do with the NAS > file sharing protocol (smb). Though I'm pretty sure that Samba can > handle UTF-8 filenames correctly. > > > I am in fact using Python 3.5. I may be lacking in unicode skills > > but I do have the sense enough to know the version of Python I am > > invoking. So I included this screenshot of that so the version of > > Python and the files list returned by os.walk > > > > http://rodperson.com/graphics/uc/files.png > > If I create a file that has the U+2019 character in it on my Linux > machine (BtrFS), and do os.walk on it, I see the character in then > string properly. So it looks like Python does the right thing, > automatically decoding from UTF-8. > > In your situation I think the problem is the file sharing protocol > that your NAS is using. Somehow some information is being lost and > your OS does not know that the filenames are in UTF-8, and just > thinks they are bytes. And therefore Python doesn't know to decode > the string, so you just end up with each byte being converted to a > unicode code point and being shoved into the unicode string. > > How to get around this issue I don't know. Maybe there's a way to > convert the unicode string to bytes using the value of each character, > and then decode that back to unicode. I think you theory is on the correct path. I'm actually attached to the NAS via NFS not samba. And just quickly looking into that it seems the NFS server needs and option set to pass unicode correctly...but my NAS software doesn't allow my access to settings only to turn it on or off. Looks like my option is the original correct the file name. -- Rod http://www.rodperson.com Who at Clitorius fountain thirst remove Loath Wine and, abstinent, meer Water love. - Ovid -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
Rod Person wrote: > Ok...so after reading all the replies in the thread, I thought I would > be easier to send a general reply and include some links to screenshots. > > As Peter mention, the logic thing to do would be to fix the file name > to what I actually thought it was and if this was for work that > probably what I would have done, but since I want to understand what's > going on I decided to waste time on that. > > I have to admit, I didn't think the file system was utf-8 as seeing what > looked to be an apostrophe sent me down the road of why is this > apostrophe screwed up instead of "ah this must be unicode". > > But doing a simple ls of that directory show it is unicode but the > replacement of the offending character. > > http://rodperson.com/graphics/uc/ls.png Have you set LANG to something that implies ASCII? $ touch Todd’s ähnlich üblich löblich $ ls ähnlich löblich Todd’s üblich $ LANG=C ls Todd???s l??blich ??hnlich ??blich $ python3 -c 'import os; print(os.listdir())' ['Todd’s', 'üblich', 'ähnlich', 'löblich'] $ LANG=C python3 -c 'import os; print(os.listdir())' ['Todd\udce2\udc80\udc99s', '\udcc3\udcbcblich', '\udcc3\udca4hnlich', 'l\udcc3\udcb6blich'] $ LANG=en_US.utf-8 python3 -c 'import os; print(os.listdir())' ['Todd’s', 'üblich', 'ähnlich', 'löblich'] For file names Python resorts to surrogates whenever a byte does not translate into a character in the advertised encoding. > I am in fact using Python 3.5. I may be lacking in unicode skills but I > do have the sense enough to know the version of Python I am invoking. I've made so many "stupid errors" myself that I always consider them first ;) > So I included this screenshot of that so the version of Python and the > files list returned by os.walk > > http://rodperson.com/graphics/uc/files.png > > So the fact that it shows as a string and not bytes in the debugger was > throwing me for a loop, in my log section I was trying to determine if > it was unicode decode it...if not don't do anything which wasn't working > > http://rodperson.com/graphics/uc/log_section.png > > > > > On Sun, 25 Jun 2017 10:47:18 +0200 > Peter Otten <__pete...@web.de> wrote: > >> Steve D'Aprano wrote: >> >> > On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote: >> >> >> if everything worked correctly? Though I don't understand why the >> >> OP doesn't see >> >> >> >> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac' >> >> >> >> which is the repr() that I get. >> > >> > That's mojibake and is always wrong :-) >> >> Yes, that's my very point. >> >> > I'm not sure how you got that. >> >> I took the OP's string at face value and pasted it into the >> interpreter: >> >> # python 3.4 >> >>> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in >> >>> Progress).flac' >> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac' >> >> > Something to do with an accidental decode to Latin-1? >> >> If the above filename is the only one or one of a few that seem >> broken, and other non-ascii filenames look OK the OP's >> toolchain/filesystem may work correctly and the odd name might have >> been produced elsewhere, e. g. by copying an already messed-up >> freedb.org entry. >> >> [Heureka] >> >> However, the most likely explanation is that the filename is correct >> and that the OP is not using Python 3 as he claims but Python 2. >> >> Yes, it took that long for me to realise ;) Python 2 is slowly >> sinking into oblivion... >> > > > -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On 06/25/2017 06:19 AM, Rod Person wrote: > But doing a simple ls of that directory show it is unicode but the > replacement of the offending character. > > http://rodperson.com/graphics/uc/ls.png Now that is really strange. Your OS seems to not recognize that the filename is in UTF-8. I suspect this has something to do with the NAS file sharing protocol (smb). Though I'm pretty sure that Samba can handle UTF-8 filenames correctly. > I am in fact using Python 3.5. I may be lacking in unicode skills but I > do have the sense enough to know the version of Python I am invoking. > So I included this screenshot of that so the version of Python and the > files list returned by os.walk > > http://rodperson.com/graphics/uc/files.png If I create a file that has the U+2019 character in it on my Linux machine (BtrFS), and do os.walk on it, I see the character in then string properly. So it looks like Python does the right thing, automatically decoding from UTF-8. In your situation I think the problem is the file sharing protocol that your NAS is using. Somehow some information is being lost and your OS does not know that the filenames are in UTF-8, and just thinks they are bytes. And therefore Python doesn't know to decode the string, so you just end up with each byte being converted to a unicode code point and being shoved into the unicode string. How to get around this issue I don't know. Maybe there's a way to convert the unicode string to bytes using the value of each character, and then decode that back to unicode. -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
Ok...so after reading all the replies in the thread, I thought I would be easier to send a general reply and include some links to screenshots. As Peter mention, the logic thing to do would be to fix the file name to what I actually thought it was and if this was for work that probably what I would have done, but since I want to understand what's going on I decided to waste time on that. I have to admit, I didn't think the file system was utf-8 as seeing what looked to be an apostrophe sent me down the road of why is this apostrophe screwed up instead of "ah this must be unicode". But doing a simple ls of that directory show it is unicode but the replacement of the offending character. http://rodperson.com/graphics/uc/ls.png I am in fact using Python 3.5. I may be lacking in unicode skills but I do have the sense enough to know the version of Python I am invoking. So I included this screenshot of that so the version of Python and the files list returned by os.walk http://rodperson.com/graphics/uc/files.png So the fact that it shows as a string and not bytes in the debugger was throwing me for a loop, in my log section I was trying to determine if it was unicode decode it...if not don't do anything which wasn't working http://rodperson.com/graphics/uc/log_section.png On Sun, 25 Jun 2017 10:47:18 +0200 Peter Otten <__pete...@web.de> wrote: > Steve D'Aprano wrote: > > > On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote: > > >> if everything worked correctly? Though I don't understand why the > >> OP doesn't see > >> > >> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac' > >> > >> which is the repr() that I get. > > > > That's mojibake and is always wrong :-) > > Yes, that's my very point. > > > I'm not sure how you got that. > > I took the OP's string at face value and pasted it into the > interpreter: > > # python 3.4 > >>> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in > >>> Progress).flac' > '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac' > > > Something to do with an accidental decode to Latin-1? > > If the above filename is the only one or one of a few that seem > broken, and other non-ascii filenames look OK the OP's > toolchain/filesystem may work correctly and the odd name might have > been produced elsewhere, e. g. by copying an already messed-up > freedb.org entry. > > [Heureka] > > However, the most likely explanation is that the filename is correct > and that the OP is not using Python 3 as he claims but Python 2. > > Yes, it took that long for me to realise ;) Python 2 is slowly > sinking into oblivion... > -- Rod http://www.rodperson.com -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On Sun, 25 Jun 2017 02:23:15 -0700, wxjmfauth wrote: > Le samedi 24 juin 2017 21:10:47 UTC+2, alister a écrit : >> On Sat, 24 Jun 2017 14:57:21 -0400, Rod Person wrote: >> >> > \xe2\x80\x99, >> >> because the file name has been created using "Right single quote" >> instead of apostrophe, the glyphs look identical in many fonts. >> >> > Trust me. Fonts are clearly making distinction between \u0027 and > \u2019. Not all, and even when they do it has absolutely nothing to do with the point of the post the character in the file name is \u2019 right quotation mark & not an apostrophe which the op was assuming. he needs to decode the file name correctly -- You will be held hostage by a radical group. -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
Steve D'Aprano wrote: > On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote: >> if everything worked correctly? Though I don't understand why the OP >> doesn't see >> >> '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac' >> >> which is the repr() that I get. > > That's mojibake and is always wrong :-) Yes, that's my very point. > I'm not sure how you got that. I took the OP's string at face value and pasted it into the interpreter: # python 3.4 >>> '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac' > Something to do with an accidental decode to Latin-1? If the above filename is the only one or one of a few that seem broken, and other non-ascii filenames look OK the OP's toolchain/filesystem may work correctly and the odd name might have been produced elsewhere, e. g. by copying an already messed-up freedb.org entry. [Heureka] However, the most likely explanation is that the filename is correct and that the OP is not using Python 3 as he claims but Python 2. Yes, it took that long for me to realise ;) Python 2 is slowly sinking into oblivion... -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On Sun, 25 Jun 2017 04:57 pm, Peter Otten wrote: > Steve D'Aprano wrote: > >> On Sun, 25 Jun 2017 07:17 am, Peter Otten wrote: >> >>> Then I'd fix the name manually... >> >> The file name isn't broken. >> >> >> What's broken is parts of the OP's code which assumes that non-ASCII file >> names are broken... > > Hm, the OP says > > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' > > Shouldn't it be > > '06 - Todd’s Song (Post-Spiderland Song in Progress).flac' It should, if the OP did everything right. He has a file name containing the word "Todd’s": # Python 3.5 py> fname = 'Todd’s' py> repr(fname) "'Todd’s'" On disk, that is represented in UTF-8: py> repr(fname.encode('utf-8')) "b'Todd\\xe2\\x80\\x99s'" The OP appears to be using Python 2, so when he calls os.listdir() he gets the file names as bytes, not Unicode. That means he'll see: - the file name will be Python 2 str, which is *byte string* not text string; - so not Unicode - rather the individual bytes in the UTF-8 encoding of the file name. So in Python 2.7 instead of 3.5 above: py> fname = u'Todd’s' py> repr(fname) "u'Todd\\u2019s'" py> repr(fname.encode('utf-8')) "'Todd\\xe2\\x80\\x99s'" > if everything worked correctly? Though I don't understand why the OP doesn't > see > > '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac' > > which is the repr() that I get. That's mojibake and is always wrong :-) I'm not sure how you got that. Something to do with an accidental decode to Latin-1? # Python 2.7 py> repr(fname.encode('utf-8').decode('latin-1')) "u'Todd\\xe2\\x80\\x99s'" # Python 3.5 py> repr(fname.encode('utf-8').decode('latin-1')) "'Toddâ\\x80\\x99s'" -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
Steve D'Aprano wrote: > On Sun, 25 Jun 2017 07:17 am, Peter Otten wrote: > >> Then I'd fix the name manually... > > The file name isn't broken. > > > What's broken is parts of the OP's code which assumes that non-ASCII file > names are broken... Hm, the OP says '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' Shouldn't it be '06 - Todd’s Song (Post-Spiderland Song in Progress).flac' if everything worked correctly? Though I don't understand why the OP doesn't see '06 - Toddâ\x80\x99s Song (Post-Spiderland Song in Progress).flac' which is the repr() that I get. -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On Sun, 25 Jun 2017 07:17 am, Peter Otten wrote: > Then I'd fix the name manually... The file name isn't broken. What's broken is parts of the OP's code which assumes that non-ASCII file names are broken... -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
Rod Person wrote: > On Sat, 24 Jun 2017 21:28:45 +0200 > Peter Otten <__pete...@web.de> wrote: > >> Rod Person wrote: >> >> > Hi, >> > >> > I'm working on a program that will walk a file system and clean the >> > id3 tags of mp3 and flac files, everything is working great until >> > the follow file is found >> > >> > '06 - Todd's Song (Post-Spiderland Song in Progress).flac' >> > >> > for some reason that I can't understand os.walk() returns this file >> > name as >> > >> > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in >> > Progress).flac' >> > >> > which then causes more hell than a little bit for me. I'm not >> > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do >> > about it. >> >> >>> b"\xe2\x80\x99".decode("utf-8") >> '’' >> >>> unicodedata.name(_) >> 'RIGHT SINGLE QUOTATION MARK' >> >> So it's '’' rather than "'". >> >> > The script is Python 3, the file system it is running on is a hammer >> > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS >> > which runs some kind of Linux so it probably ext3/4. The files came >> > from various system (Mac, Windows, FreeBSD). >> >> There seems to be a mismatch between the assumed and the actual file >> system encoding somewhere in this mix. Is this the only glitch or are >> there similar problems with other non-ascii characters? >> > > This is the only glitch as in file names so far. > Then I'd fix the name manually... -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On 2017-06-24 20:47, Rod Person wrote: On Sat, 24 Jun 2017 13:28:55 -0600 Michael Torriewrote: On 06/24/2017 12:57 PM, Rod Person wrote: > Hi, > > I'm working on a program that will walk a file system and clean the > id3 tags of mp3 and flac files, everything is working great until > the follow file is found > > '06 - Todd's Song (Post-Spiderland Song in Progress).flac' > > for some reason that I can't understand os.walk() returns this file > name as > > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in > Progress).flac' That's basically a UTF-8 string there: $ python3 >>> a= b'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' >>> print (a.decode('utf-8')) 06 - Todd’s Song (Post-Spiderland Song in Progress).flac >>> The NAS is just happily reading the UTF-8 bytes and passing them on the wire. > which then causes more hell than a little bit for me. I'm not > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do > about it. It's clearly not an apostrophe in the original filename, but probably U+2019 (’) > The script is Python 3, the file system it is running on is a hammer > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS > which runs some kind of Linux so it probably ext3/4. The files came > from various system (Mac, Windows, FreeBSD). It's the file serving protocol that dictates how filenames are transmitted. In your case it's probably smb. smb (samba) is just passing the native bytes along from the file system. Since you know the native file system is just UTF-8, you can just decode every filename from utf-8 bytes into unicode. This is the impression that I was under, my unicode is that strong, so maybe my understand is off...but I tried. file_name = file_name.decode('utf-8', 'ignore') but when I get to my logging code: logfile.write(file_name) that throws the error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 39-41: ordinal not in range(128) Your logfile was opened with the 'ascii' encoding, so you can't write anything outside the ASCII range. Open it with the 'utf-8' encoding instead. -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
Can os.fsencode and os.fsdecode help? I've seen it somewhere. I've never used it. To fix encodings, sometimes I use the module ftfy Greetings Andre -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On Sat, 24 Jun 2017 13:28:55 -0600 Michael Torriewrote: > On 06/24/2017 12:57 PM, Rod Person wrote: > > Hi, > > > > I'm working on a program that will walk a file system and clean the > > id3 tags of mp3 and flac files, everything is working great until > > the follow file is found > > > > '06 - Todd's Song (Post-Spiderland Song in Progress).flac' > > > > for some reason that I can't understand os.walk() returns this file > > name as > > > > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in > > Progress).flac' > > That's basically a UTF-8 string there: > > $ python3 > >>> a= b'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in > Progress).flac' > >>> print (a.decode('utf-8')) > 06 - Todd’s Song (Post-Spiderland Song in Progress).flac > >>> > > The NAS is just happily reading the UTF-8 bytes and passing them on > the wire. > > > which then causes more hell than a little bit for me. I'm not > > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do > > about it. > > It's clearly not an apostrophe in the original filename, but probably > U+2019 (’) > > > The script is Python 3, the file system it is running on is a hammer > > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS > > which runs some kind of Linux so it probably ext3/4. The files came > > from various system (Mac, Windows, FreeBSD). > > It's the file serving protocol that dictates how filenames are > transmitted. In your case it's probably smb. smb (samba) is just > passing the native bytes along from the file system. Since you know > the native file system is just UTF-8, you can just decode every > filename from utf-8 bytes into unicode. This is the impression that I was under, my unicode is that strong, so maybe my understand is off...but I tried. file_name = file_name.decode('utf-8', 'ignore') but when I get to my logging code: logfile.write(file_name) that throws the error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 39-41: ordinal not in range(128) -- Rod http://www.rodperson.com Who at Clitorius fountain thirst remove Loath Wine and, abstinent, meer Water love. - Ovid -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On Sat, 24 Jun 2017 21:28:45 +0200 Peter Otten <__pete...@web.de> wrote: > Rod Person wrote: > > > Hi, > > > > I'm working on a program that will walk a file system and clean the > > id3 tags of mp3 and flac files, everything is working great until > > the follow file is found > > > > '06 - Todd's Song (Post-Spiderland Song in Progress).flac' > > > > for some reason that I can't understand os.walk() returns this file > > name as > > > > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in > > Progress).flac' > > > > which then causes more hell than a little bit for me. I'm not > > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do > > about it. > > >>> b"\xe2\x80\x99".decode("utf-8") > '’' > >>> unicodedata.name(_) > 'RIGHT SINGLE QUOTATION MARK' > > So it's '’' rather than "'". > > > The script is Python 3, the file system it is running on is a hammer > > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS > > which runs some kind of Linux so it probably ext3/4. The files came > > from various system (Mac, Windows, FreeBSD). > > There seems to be a mismatch between the assumed and the actual file > system encoding somewhere in this mix. Is this the only glitch or are > there similar problems with other non-ascii characters? > This is the only glitch as in file names so far. -- Rod http://www.rodperson.com Who at Clitorius fountain thirst remove Loath Wine and, abstinent, meer Water love. - Ovid -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
Rod Person wrote: > Hi, > > I'm working on a program that will walk a file system and clean the id3 > tags of mp3 and flac files, everything is working great until the > follow file is found > > '06 - Todd's Song (Post-Spiderland Song in Progress).flac' > > for some reason that I can't understand os.walk() returns this file > name as > > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' > > which then causes more hell than a little bit for me. I'm not > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do > about it. >>> b"\xe2\x80\x99".decode("utf-8") '’' >>> unicodedata.name(_) 'RIGHT SINGLE QUOTATION MARK' So it's '’' rather than "'". > The script is Python 3, the file system it is running on is a hammer > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which > runs some kind of Linux so it probably ext3/4. The files came from > various system (Mac, Windows, FreeBSD). There seems to be a mismatch between the assumed and the actual file system encoding somewhere in this mix. Is this the only glitch or are there similar problems with other non-ascii characters? -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On 06/24/2017 12:57 PM, Rod Person wrote: > Hi, > > I'm working on a program that will walk a file system and clean the id3 > tags of mp3 and flac files, everything is working great until the > follow file is found > > '06 - Todd's Song (Post-Spiderland Song in Progress).flac' > > for some reason that I can't understand os.walk() returns this file > name as > > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' That's basically a UTF-8 string there: $ python3 >>> a= b'06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' >>> print (a.decode('utf-8')) 06 - Todd’s Song (Post-Spiderland Song in Progress).flac >>> The NAS is just happily reading the UTF-8 bytes and passing them on the wire. > which then causes more hell than a little bit for me. I'm not > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do > about it. It's clearly not an apostrophe in the original filename, but probably U+2019 (’) > The script is Python 3, the file system it is running on is a hammer > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which > runs some kind of Linux so it probably ext3/4. The files came from > various system (Mac, Windows, FreeBSD). It's the file serving protocol that dictates how filenames are transmitted. In your case it's probably smb. smb (samba) is just passing the native bytes along from the file system. Since you know the native file system is just UTF-8, you can just decode every filename from utf-8 bytes into unicode. -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On 2017-06-24 19:57, Rod Person wrote: Hi, I'm working on a program that will walk a file system and clean the id3 tags of mp3 and flac files, everything is working great until the follow file is found '06 - Todd's Song (Post-Spiderland Song in Progress).flac' for some reason that I can't understand os.walk() returns this file name as '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' which then causes more hell than a little bit for me. I'm not understand why apostrophe(') becomes \xe2\x80\x99, or what I can do about it. The script is Python 3, the file system it is running on is a hammer filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which runs some kind of Linux so it probably ext3/4. The files came from various system (Mac, Windows, FreeBSD). If you treat it as a bytestring b'\xe2\x80\x99' and decode it: >>> c = b'\xe2\x80\x99'.decode('utf-8') >>> ascii(c) "'\\u2019'" >>> import unicodedata >>> unicodedata.name(c) 'RIGHT SINGLE QUOTATION MARK' It's not an apostrophe, it's '\u2019' ('\N{RIGHT SINGLE QUOTATION MARK}'). It looks like the filename is encoded as UTF-8, but Python thinks that the filesystem encoding is something like Latin-1. -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On Saturday, June 24, 2017 at 12:07:05 PM UTC-7, Rod Person wrote: > Hi, > > I'm working on a program that will walk a file system and clean the id3 > tags of mp3 and flac files, everything is working great until the > follow file is found > > '06 - Todd's Song (Post-Spiderland Song in Progress).flac' > > for some reason that I can't understand os.walk() returns this file > name as > > '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' > > which then causes more hell than a little bit for me. I'm not > understand why apostrophe(') becomes \xe2\x80\x99, or what I can do > about it. That's a "right single quotation mark" character in Unicode. http://unicode.scarfboy.com/?s=E28099 Something in your code is choosing to interpret the text variable as an old-fashioned byte array of characters, where every character is represented by a single byte. That works as long as the file name only uses characters from the old ASCII set, but there are only 128 of those. > The script is Python 3, the file system it is running on is a hammer > filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which > runs some kind of Linux so it probably ext3/4. The files came from > various system (Mac, Windows, FreeBSD). Since you are working in Python3, you have the ability to call the .encode() and .decode() methods to translate between Unicode and byte character arrays (which you still need on occasion). > > -- > Rod > > http://www.rodperson.com -- https://mail.python.org/mailman/listinfo/python-list
Re: os.walk the apostrophe and unicode
On Sat, 24 Jun 2017 14:57:21 -0400, Rod Person wrote: > \xe2\x80\x99, because the file name has been created using "Right single quote" instead of apostrophe, the glyphs look identical in many fonts. -- "If you understand what you're doing, you're not learning anything." -- A. L. -- https://mail.python.org/mailman/listinfo/python-list
os.walk the apostrophe and unicode
Hi, I'm working on a program that will walk a file system and clean the id3 tags of mp3 and flac files, everything is working great until the follow file is found '06 - Todd's Song (Post-Spiderland Song in Progress).flac' for some reason that I can't understand os.walk() returns this file name as '06 - Todd\xe2\x80\x99s Song (Post-Spiderland Song in Progress).flac' which then causes more hell than a little bit for me. I'm not understand why apostrophe(') becomes \xe2\x80\x99, or what I can do about it. The script is Python 3, the file system it is running on is a hammer filesystem on DragonFlyBSD. The audio files reside on a QNAP NAS which runs some kind of Linux so it probably ext3/4. The files came from various system (Mac, Windows, FreeBSD). -- Rod http://www.rodperson.com -- https://mail.python.org/mailman/listinfo/python-list