Michael Spencer wrote: > Steven Bethard wrote: > >> I've got a list of word substrings (the "tokens") which I need to >> align to a string of text (the "sentence"). The sentence is basically >> the concatenation of the token list, with spaces sometimes inserted >> beetween tokens. I need to determine the start and end offsets of >> each token in the sentence. For example:: >> >> py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?'] >> py> text = '''\ >> ... She's gonna write >> ... a book?''' >> py> list(offsets(tokens, text)) >> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)] >> > [snip] > > and then, for an entry in the wacky category, a difflib solution: > > >>> def offsets(tokens, text): > ... from difflib import SequenceMatcher > ... s = SequenceMatcher(None, text, "\t".join(tokens)) > ... for start, _, length in s.get_matching_blocks(): > ... if length: > ... yield start, start + length > ... > >>> list(offsets(tokens, text)) > [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
That's cool, I've never seen that before. If you pass in str.isspace, you can even drop the "if length:" line:: py> def offsets(tokens, text): ... s = SequenceMatcher(str.isspace, text, '\t'.join(tokens)) ... for start, _, length in s.get_matching_blocks(): ... yield start, start + length ... py> list(offsets(tokens, text)) [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25), (25, 25)] I think I'm going to have to take a closer look at difflib.SequenceMatcher; I have to do things similar to this pretty often... STeVe -- http://mail.python.org/mailman/listinfo/python-list