Bug#449054: exfalso: Better TitleCase function

Javier Kohen Sat, 03 Nov 2007 02:28:27 -0800

El vie, 02-11-2007 a las 18:35 +0100, Erich Schubert escribió:
> Package: exfalso
> Version: 1.0-1
> Severity: normal
> Tags: Patch
> 
> I'm not very happy with the included Title-Case function of
> exfalso/quodlibet. For example, (test) is not being title-cased
> properly, since the t isn't preceded by a space.


I'm attaching a version of title that uses the unicodedata module to
find word boundaries as an alternative approach. I included some tests,
which show that both Unicode and 8-bit strings are handled correctly
according to Unicode rules (which for Python might or might not consider
the current locale, I'm not sure, to be honest). At least it works for
your example. It would be great if Python exported Unicode properties
such as "Word-Break," but it doesn't... so it's a bit of a hack, but it
seems to work better than the current algorithm. Feel free to bring more
test cases to my attention.

By the way, Python's Unicode support is broken. Capitalization of ß
yields ß, whereas Unicode defines it according to German spelling as SS
(see the first entry in SpecialCasing.txt).

Disclaimer: title casing in QL and Python is oriented towards English
language (and others, by coincidence). Spanish, for instance, uses lower
case for titles, except on the first word and personal names.

Greetings,
-- 
Javier Kohen <[EMAIL PROTECTED]>
ICQ: blashyrkh #2361802
Jabber: [EMAIL PROTECTED]

#!/usr/bin/python
# -*- coding: utf-8 -*-
import unicodedata

def iswbound(char):
    """Returns whether the given character is a word boundary."""
    # Special case apostrophe, since it's punctuation, but more
    #commonly used to form the possessive in song titles.
    if u"'" == char: return False

    category = unicodedata.category(char)
    # If it's a space separator or punctuation
    return 'Zs' == category or 'P' == category[0]

def utitle(string):
    """Title-case a string using a less destructive method than str.title."""
    new_string = string[0].capitalize()
    cap = False
    for s in string[1:]:
        if iswbound(s): cap = True
        elif cap and s.isalpha():
            cap = False
            s = s.capitalize()
        else: cap = False
        new_string += s
    print new_string
    return new_string

from types import UnicodeType
from locale import getpreferredencoding

def title(string):
    """Title-case a string using a less destructive method than str.title."""
    if not string: return ""
    if (not isinstance(string, UnicodeType)):
        string = unicode(string.decode(getpreferredencoding()))
    return utitle(string)

assert u"Mama's Boy" == title(u"mama's boy")
assert u"Mama’S Boy" == title(u"mama’s boy")
assert u"The A-Sides" == title(u"the a-sides")
assert u"Hello Goodbye" == title(u"hello goodbye")
assert u"HELLO GOODBYE" == title(u"HELLO GOODBYE")
assert u"Hello Goodbye (A Song)" == title(u"hello goodbye (a song)")
assert u"Hello Goodbye \"A Song\"" == title(u"hello goodbye \"a song\"")
assert u"Hello Goodbye „A Song”" == title(u"hello goodbye „a song”")
assert u"Hello Goodbye “A Song”" == title(u"hello goodbye “a song”")
assert u"Hello Goodbye »A Song«" == title(u"hello goodbye »a song«")
assert u"Hello Goodbye «A Song»" == title(u"hello goodbye «a song»")

assert u"Fooäbar" == title(u"fooäbar")
assert u"Los Años Felices" == title(u"los años felices")
assert u"Ñandú" == title(u"ñandú")
# Not a real word, but still Python doesn't capitalize the es-zed properly.
#assert u"SSbahn" == title(u"ßbahn")

assert u"Fooäbar" == title("fooäbar")
assert u"Los Años Felices" == title("los años felices")
assert u"Ñandú" == title("ñandú")

signature.asc
Description: Esta parte del mensaje está firmada digitalmente

Bug#449054: exfalso: Better TitleCase function

Reply via email to