I've got a list of word substrings (the "tokens") which I need to align to a string of text (the "sentence"). The sentence is basically the concatenation of the token list, with spaces sometimes inserted beetween tokens. I need to determine the start and end offsets of each token in the sentence. For example::
py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?'] py> text = '''\ ... She's gonna write ... a book?''' py> list(offsets(tokens, text)) [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)] Here's my current definition of the offsets function:: py> def offsets(tokens, text): ... start = 0 ... for token in tokens: ... while text[start].isspace(): ... start += 1 ... text_token = text[start:start+len(token)] ... assert text_token == token, (text_token, token) ... yield start, start + len(token) ... start += len(token) ... I feel like there should be a simpler solution (maybe with the re module?) but I can't figure one out. Any suggestions? STeVe -- http://mail.python.org/mailman/listinfo/python-list