Bug#449054: exfalso: Better TitleCase function

Javier Kohen Sat, 03 Nov 2007 10:52:54 -0800

El sáb, 03-11-2007 a las 19:21 +0100, Erich Schubert escribió: 
> Hi,
> > That's exactly why I mentioned it. Some people love to uppercase
> > everything in any language and I usually find myself fixing it by hand.
> > Ignoring proper names, the conversion is trivial ("s[0].upper() +
> > s[1:].lower()" will do it), but I was a bit lazy to actually write the
> > plug-in...
> 
> That is slightly different from what I had in mind: it actively
> lowercases anything else, ignoring potential proper names.
> I'd have offered  s[0].upper() + s[1:]  as an option.
> Your option does make sense though, maybe as "Remove Title Case"
> (I.e. automatically uppercase the first character, leave the rest alone.
> For example, a song might be titled "für Elise". In that situation, it
> should be transformed to "Für Elise". Another example is abbreviations.
> "JD's blues" for example should be left unchanged.


What I wrote is exactly what I had in mind. Unlike in German, uppercase
words in languages such as Spanish, French or Norwegian, to name a few,
are very scarce except at the beginning of the sentence, and title-case
in those languages, like in German, follows the regular spelling rules.

> Another thing my titlecase function did that yours doesn't is process
> all-uppercase strings.
> I'm aware that occasionally you'll have a track titled "TLA" that
> shouldn't be converted to "Tla". But a very common bad title is "EXAMPLE
> TRACK TITLE", where I'd like to have an easy way to make that into
> "Example Track Title".

My code only attempts to fix faults in QL's current algorithm, which by
design never removes upper case. Your suggestion is trivial to implement
(actually you only have to remove code), but I did not have an argument
to justify the change. We could add some heuristics, such as "if
everything is uppercase, then apply lowercase first, otherwise honor
existing uppercase in the string." What do you think? I'm attaching a
new version of my function that does that.

> Speaking of which, did anyone manage to get the re.LOCALE flag to work?
> I've yet to see a proper way of using that...

I don't know if Python supports Unicode categories in its Regular
Expression. Maybe you should try that (e.g. \P{L}) instead of \W, since
\W is only defined over the ASCII range, as you found out.

In that case, I'm guessing that you won't need that flag (although, to
be honest, not having read the documentation I don't know what it's
supposed to do).

Regards,
-- 
Javier Kohen <[EMAIL PROTECTED]>
ICQ: blashyrkh #2361802
Jabber: [EMAIL PROTECTED]

#!/usr/bin/python
# -*- coding: utf-8 -*-
import unicodedata

def iswbound(char):
    """Returns whether the given character is a word boundary."""
    category = unicodedata.category(char)
    # If it's a space separator or punctuation
    return 'Zs' == category or 'P' == category[0]

def utitle(string):
    """Title-case a string using a less destructive method than str.title."""
    if string.upper() == string:
        string = string.lower()
    new_string = string[0].capitalize()
    cap = False
    for i in xrange(1, len(string)):
        s = string[i]
        # Special case apostrophe in the middle of a word.
        if u"'" == s and string[i-1].isalpha(): cap = False
        elif iswbound(s): cap = True
        elif cap and s.isalpha():
            cap = False
            s = s.capitalize()
        else: cap = False
        new_string += s
    print new_string
    return new_string

from types import UnicodeType
from locale import getpreferredencoding

def title(string):
    """Title-case a string using a less destructive method than str.title."""
    if not string: return ""
    if (not isinstance(string, UnicodeType)):
        string = unicode(string.decode(getpreferredencoding()))
    return utitle(string)

assert u"Mama's Boy" == title(u"mama's boy")
# This character is not an apostrophe, it's a single quote!
assert u"Mama’S Boy" == title(u"mama’s boy")
assert u"The A-Sides" == title(u"the a-sides")
assert u"Hello Goodbye" == title(u"hello goodbye")
assert u"Hello Goodbye" == title(u"HELLO GOODBYE")
assert u"Hello GOODBYE" == title(u"hello GOODBYE")
assert u"Hello G.O.O.D.B.Y.E." == title(u"hello G.O.O.D.B.Y.E.")
assert u"Hello G.O.O.D.B.Y.E." == title(u"HELLO G.O.O.D.B.Y.E.")
assert u"Hello Goodbye (A Song)" == title(u"hello goodbye (a song)")
assert u"Hello Goodbye 'A Song'" == title(u"hello goodbye 'a song'")
assert u"Hello Goodbye \"A Song\"" == title(u"hello goodbye \"a song\"")
assert u"Hello Goodbye „A Song”" == title(u"hello goodbye „a song”")
assert u"Hello Goodbye ‘A Song’" == title(u"hello goodbye ‘a song’")
assert u"Hello Goodbye “A Song”" == title(u"hello goodbye “a song”")
assert u"Hello Goodbye »A Song«" == title(u"hello goodbye »a song«")
assert u"Hello Goodbye «A Song»" == title(u"hello goodbye «a song»")

assert u"Fooäbar" == title(u"fooäbar")
assert u"Los Años Felices" == title(u"los años felices")
assert u"Ñandú" == title(u"ñandú")
# Not a real word, but still Python doesn't capitalize the es-zed properly.
#assert u"SSbahn" == title(u"ßbahn")

assert u"Fooäbar" == title("fooäbar")
assert u"Los Años Felices" == title("los años felices")
assert u"Ñandú" == title("ñandú")

signature.asc
Description: Esta parte del mensaje está firmada digitalmente

Bug#449054: exfalso: Better TitleCase function

Reply via email to