On Wednesday 25 August 2010, it occurred to Jed to exclaim: > Hi, I'm seeking help with a fairly simple string processing task. > I've simplified what I'm actually doing into a hypothetical > equivalent. > Suppose I want to take a word in Spanish, and divide it into > individual letters. The problem is that there are a few 2-character > combinations that are considered single letters in Spanish - for > example 'ch', 'll', 'rr'. > Suppose I have: > > alphabet = ['a','b','c','ch','d','u','r','rr','o'] #this would include > the whole alphabet but I shortened it here > theword = 'churro' > > I would like to split the string 'churro' into a list containing: > > 'ch','u','rr','o' > > So at each letter I want to look ahead and see if it can be combined > with the next letter to make a single 'letter' of the Spanish > alphabet. I think this could be done with a regular expression > passing the list called "alphabet" to re.match() for example, but I'm > not sure how to use the contents of a whole list as a search string in > a regular expression, or if it's even possible. My real application > is a bit more complex than the Spanish alphabet so I'm looking for a > fairly general solution.
A very simple solution that might be general enough: >>> def tokensplit(string, bits): ... while string: ... for b in bits: ... if string.startswith(b): ... yield b ... string = string[len(b):] ... break ... else: ... raise ValueError("string not composed of the right bits.") ... >>> >>> alphabet = ['a','b','c','ch','d','u','r','rr','o'] >>> # move longer letters to the front >>> alphabet.sort(key=len, reverse=True) >>> >>> list(tokensplit("churro", alphabet)) ['ch', 'u', 'rr', 'o'] >>> -- http://mail.python.org/mailman/listinfo/python-list