On Wed, Feb 26, 2014 at 05:09:49PM +0000, Bob Williams wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi List,
> 
> I have two problems, but it's possible that one solution will suffice.
> I am using a module called mutagen to extract audio metadata from
> .flac files. The output of mutagen is in the form of a dictionary, so
> 
> In [1]: import mutagen.flac
> 
> In [2]: metadata = mutagen.flac.Open("/home/bob/music/artists/The
> Incredible String Band/1967 The 5000 Spirits Or The Layers Of The
> Onion/08 The Hedgehog's Song.flac")
> 
> In [3]: print metadata["artist"]
> [u'The Incredible String Band']

Get rid of the print, which automatically converts whatever you pass it 
to strings and then displays those strings. If you inspect the item 
directly, you will see that the value you have is not a string:

    "[u'The Incredible String Band']"


but a list [ ... ] containing one item, which is a string.

Since you are using an interactive shell, in this case ipython, you can 
drop the call to print, and just enter metadata["artist"] on its own, 
and you'll see *exactly the same output*, without quotation marks on the 
outside. That tells you that what you have is not a string.

(If it were a string, you would see quote marks surrounding it.)

If you're still not convinced, call:

    type(metadata['artist'])

and take note of what it says.

Once you have convinced yourself that it is in fact a list of one item, 
you can extract that item like this:

    item = metadata['artist'][0]

but beware! The fact that mutagen returns a list rather than the string 
directly warns you that sometimes there might be two or more pieces of 
metadata with the same key, e.g.:

    # some imaginary metadata from a hypothetical FLAC file
    [u'The Beatles', u'The Rolling Stones', u'ABBA']

So you need to be prepared to deal with multiple metadata items.

One last thing: you *do not* want to get rid of the leading u, trust me 
on this. The u is not actually part of the string itself, it is just a 
delimiter. What you are seeing is the difference between a Unicode text 
string and a byte-string.

A regular string with "" or '' delimiters consists of a sequence of 
bytes. Bytes, as you probably are aware, are numbers between 0 and 255 
inclusive. But you don't enter them using their numeric value, but by 
their character value. Python gives you two functions for converting 
between the numeric and character values:

    chr(n)  # returns the character of ordinal n
    ord(c)  # returns the ordinal of character c

and uses ASCII for the first 127 ordinal values, and some arbitary and 
likely unpredicatable scheme for the rest.

Byte strings have their uses, but for text, it's not 1960 any longer, 
and there is an entire world filled with people for whom ASCII is not 
enough. (In truth, *even in America*, the ASCII character set was never 
sufficient for all common uses, since it lacks symbols such as ¢.) In 
the 1980s and 90s the world proliferated a confusing mess of dozens of 
alternative character sets, often called "extended ASCII" as if there 
were only one, but fortunately it is now 2014 and the right solution is 
to use Unicode.

Unlike byte-strings, which only contain 256 possible characters, Unicode 
strings can contain over a million distinct characters, numbered between 
U+0000 and U+10FFFF (the number after the U+ is in hexadecimal). It 
contains a dedicated character (technically called a "code point") for 
each and every character included in all of those dozens of legacy 
so-called "extended ASCIIs", plus many more that they never included.

Unicode strings use delimiters u"" and u'', so as you can see the u is 
*outside* the quote marks, it is part of the delimiter, not part of the 
string. Unicode strings allow metadata to include artist's who have 
non-ASCII characters in their names, like Sinéad O'Connor and Björk, as 
well as stylistic "heavy metal umlauts" as used by artists like William 
Ørbit and Blue Öyster Cult. And even totally pretentious wankfests like 
▼□■□■□■, and no I have no idea how that's pronounced.

(Alas, the Love Symbol in The Artist Formerly Known As Love Symbol is 
not available in Unicode, so he'll have to be known as The Artist 
Formerly Known As The Artist Formerly Known As Prince.)

So you should prefer Unicode strings over byte-strings. Apart from the 
leading u prefix, there is practically no difference in how you use 
them. All the usual string methods are available:

py> print(u'Björk'.upper())
BJÖRK

Just be careful about mixing regular '' byte strings and proper u'' 
Unicode text strings. Python 2 tries to do the "smart" thing when you 
combine them, and while that works 9 times out of 10, the tenth time you 
end up even more confused than ever. (Python 3 is far more strict about 
keeping them separate.)


-- 
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to