I put together the following module today and would like some feedback on any obvious problems. Or even opinions of weather or not it is a good approach.
While collating is not a difficult thing to do for experienced programmers, I have seen quite a lot of poorly sorted lists in commercial applications, so it seems it would be good to have an easy to use ready made API for collating. I tried to make this both easy to use and flexible. My first thoughts was to try and target actual uses such as Phone directory sorting, or Library sorting, etc., but it seemed using keywords to alter the behavior is both easier and more flexible. I think the regular expressions I used to parse leading and trailing numerals could be improved. They work, but you will probably get inconsistent results if the strings are not well formed. Any suggestions on this would be appreciated. Should I try to extend it to cover dates and currency sorting? Probably those types should be converted before sorting, but maybe sometimes it's useful not to? Another variation is collating dewy decimal strings. It should be easy to add if someone thinks that might be useful. I haven't tested this in *anything* yet, so don't plug it into production code of any type. I also haven't done any performance testing. See the doc tests below for examples of how it's used. Cheers, Ron Adam """ Collate.py A general purpose configurable collate module. Collation can be modified with the following keywords: CAPS_FIRST -> Aaa, aaa, Bbb, bbb HYPHEN_AS_SPACE -> Don't ignore hyphens UNDERSCORE_AS_SPACE -> Underscores as white space IGNORE_LEADING_WS -> Disregard leading white space NUMERICAL -> Digit sequences as numerals COMMA_IN_NUMERALS -> Allow commas in numerals * See doctests for examples. Author: Ron Adam, [EMAIL PROTECTED], 10/18/2006 """ import re import locale locale.setlocale(locale.LC_ALL, '') # use current locale settings # The above line may change the string constants from the string # module. This may have unintended effects if your program # assumes they are always the ascii defaults. CAPS_FIRST = 1 NUMERICAL = 2 HYPHEN_AS_SPACE = 4 UNDERSCORE_AS_SPACE = 8 IGNORE_LEADING_WS = 16 COMMA_IN_NUMERALS = 32 class Collate(object): """ A general purpose and configurable collator class. """ def __init__(self, flag): self.flag = flag def transform(self, s): """ Transform a string for collating. """ if self.flag & CAPS_FIRST: s = s.swapcase() if self.flag & HYPHEN_AS_SPACE: s = s.replace('-', ' ') if self.flag & UNDERSCORE_AS_SPACE: s = s.replace('_', ' ') if self.flag & IGNORE_LEADING_WS: s = s.strip() if self.flag & NUMERICAL: if self.flag & COMMA_IN_NUMERALS: rex = re.compile('^(\d*\,?\d*\.?\d*)(\D*)(\d*\,?\d*\.?\d*)', re.LOCALE) else: rex = re.compile('^(\d*\.?\d*)(\D*)(\d*\.?\d*)', re.LOCALE) slist = rex.split(s) for i, x in enumerate(slist): if self.flag & COMMA_IN_NUMERALS: x = x.replace(',', '') try: slist[i] = float(x) except: slist[i] = locale.strxfrm(x) return slist return locale.strxfrm(s) def __call__(self, a, b): """ This allows the Collate class work as a sort key. USE: list.sort(key=Collate(flags)) """ return cmp(self.transform(a), self.transform(b)) def collate(slist, flags=0): """ Collate list of strings in place. """ return slist.sort(Collate(flags)) def collated(slist, flags=0): """ Return a collated list of strings. This is a decorate-undecorate collate. """ collator = Collate(flags) dd = [(collator.transform(x), x) for x in slist] dd.sort() return list([B for (A, B) in dd]) def _test(): """ DOC TESTS AND EXAMPLES: Sort (and sorted) normally order all words beginning with caps before all words beginning with lower case. >>> t = ['tuesday', 'Tuesday', 'Monday', 'monday'] >>> sorted(t) # regular sort ['Monday', 'Tuesday', 'monday', 'tuesday'] Locale collation puts words beginning with caps after words beginning with lower case of the same letter. >>> collated(t) ['monday', 'Monday', 'tuesday', 'Tuesday'] The CAPS_FIRST option can be used to put all words beginning with caps after words beginning in lowercase of the same letter. >>> collated(t, CAPS_FIRST) ['Monday', 'monday', 'Tuesday', 'tuesday'] The HYPHEN_AS_SPACE option causes hyphens to be equal to space. >>> t = ['a-b', 'b-a', 'aa-b', 'bb-a'] >>> collated(t) ['aa-b', 'a-b', 'b-a', 'bb-a'] >>> collated(t, HYPHEN_AS_SPACE) ['a-b', 'aa-b', 'b-a', 'bb-a'] The IGNORE_LEADING_WS and UNDERSCORE_AS_SPACE options can be used together to improve ordering in some situations. >>> t = ['sum', '__str__', 'about', ' round'] >>> collated(t) [' round', '__str__', 'about', 'sum'] >>> collated(t, IGNORE_LEADING_WS) ['__str__', 'about', ' round', 'sum'] >>> collated(t, UNDERSCORE_AS_SPACE) [' round', '__str__', 'about', 'sum'] >>> collated(t, IGNORE_LEADING_WS|UNDERSCORE_AS_SPACE) ['about', ' round', '__str__', 'sum'] The NUMERICAL option orders leading and trailing digits as numerals. >>> t = ['a5', 'a40', '4abc', '20abc', 'a10.2', '13.5b', 'b2'] >>> collated(t, NUMERICAL) ['4abc', '13.5b', '20abc', 'a5', 'a10.2', 'a40', 'b2'] The COMMA_IN_NUMERALS option ignores commas instead of using them to seperate numerals. >>> t = ['a5', 'a4,000', '500b', '100,000b'] >>> collated(t, NUMERICAL|COMMA_IN_NUMERALS) ['500b', '100,000b', 'a5', 'a4,000'] Collating also can be done in place using collate() instead of collated(). >>> t = ['Fred', 'Ron', 'Carol', 'Bob'] >>> collate(t) >>> t ['Bob', 'Carol', 'Fred', 'Ron'] """ import doctest doctest.testmod() if __name__ == '__main__': _test() -- http://mail.python.org/mailman/listinfo/python-list