On Wed, Feb 26, 2014 at 05:09:49PM +0000, Bob Williams wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi List, > > I have two problems, but it's possible that one solution will suffice. > I am using a module called mutagen to extract audio metadata from > .flac files. The output of mutagen is in the form of a dictionary, so > > In [1]: import mutagen.flac > > In [2]: metadata = mutagen.flac.Open("/home/bob/music/artists/The > Incredible String Band/1967 The 5000 Spirits Or The Layers Of The > Onion/08 The Hedgehog's Song.flac") > > In [3]: print metadata["artist"] > [u'The Incredible String Band']
Get rid of the print, which automatically converts whatever you pass it to strings and then displays those strings. If you inspect the item directly, you will see that the value you have is not a string: "[u'The Incredible String Band']" but a list [ ... ] containing one item, which is a string. Since you are using an interactive shell, in this case ipython, you can drop the call to print, and just enter metadata["artist"] on its own, and you'll see *exactly the same output*, without quotation marks on the outside. That tells you that what you have is not a string. (If it were a string, you would see quote marks surrounding it.) If you're still not convinced, call: type(metadata['artist']) and take note of what it says. Once you have convinced yourself that it is in fact a list of one item, you can extract that item like this: item = metadata['artist'][0] but beware! The fact that mutagen returns a list rather than the string directly warns you that sometimes there might be two or more pieces of metadata with the same key, e.g.: # some imaginary metadata from a hypothetical FLAC file [u'The Beatles', u'The Rolling Stones', u'ABBA'] So you need to be prepared to deal with multiple metadata items. One last thing: you *do not* want to get rid of the leading u, trust me on this. The u is not actually part of the string itself, it is just a delimiter. What you are seeing is the difference between a Unicode text string and a byte-string. A regular string with "" or '' delimiters consists of a sequence of bytes. Bytes, as you probably are aware, are numbers between 0 and 255 inclusive. But you don't enter them using their numeric value, but by their character value. Python gives you two functions for converting between the numeric and character values: chr(n) # returns the character of ordinal n ord(c) # returns the ordinal of character c and uses ASCII for the first 127 ordinal values, and some arbitary and likely unpredicatable scheme for the rest. Byte strings have their uses, but for text, it's not 1960 any longer, and there is an entire world filled with people for whom ASCII is not enough. (In truth, *even in America*, the ASCII character set was never sufficient for all common uses, since it lacks symbols such as ¢.) In the 1980s and 90s the world proliferated a confusing mess of dozens of alternative character sets, often called "extended ASCII" as if there were only one, but fortunately it is now 2014 and the right solution is to use Unicode. Unlike byte-strings, which only contain 256 possible characters, Unicode strings can contain over a million distinct characters, numbered between U+0000 and U+10FFFF (the number after the U+ is in hexadecimal). It contains a dedicated character (technically called a "code point") for each and every character included in all of those dozens of legacy so-called "extended ASCIIs", plus many more that they never included. Unicode strings use delimiters u"" and u'', so as you can see the u is *outside* the quote marks, it is part of the delimiter, not part of the string. Unicode strings allow metadata to include artist's who have non-ASCII characters in their names, like Sinéad O'Connor and Björk, as well as stylistic "heavy metal umlauts" as used by artists like William Ørbit and Blue Öyster Cult. And even totally pretentious wankfests like ▼□■□■□■, and no I have no idea how that's pronounced. (Alas, the Love Symbol in The Artist Formerly Known As Love Symbol is not available in Unicode, so he'll have to be known as The Artist Formerly Known As The Artist Formerly Known As Prince.) So you should prefer Unicode strings over byte-strings. Apart from the leading u prefix, there is practically no difference in how you use them. All the usual string methods are available: py> print(u'Björk'.upper()) BJÖRK Just be careful about mixing regular '' byte strings and proper u'' Unicode text strings. Python 2 tries to do the "smart" thing when you combine them, and while that works 9 times out of 10, the tenth time you end up even more confused than ever. (Python 3 is far more strict about keeping them separate.) -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor