Steven Bethard wrote: > Michael Spencer wrote: > >> Steven Bethard wrote: >> >>> I've got a list of word substrings (the "tokens") which I need to >>> align to a string of text (the "sentence"). The sentence is >>> basically the concatenation of the token list, with spaces sometimes >>> inserted beetween tokens. I need to determine the start and end >>> offsets of each token in the sentence. For example:: >>> >>> py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?'] >>> py> text = '''\ >>> ... She's gonna write >>> ... a book?''' >>> py> list(offsets(tokens, text)) >>> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, >>> 25)] >>> >> > [snip] > >> >> and then, for an entry in the wacky category, a difflib solution: >> >> >>> def offsets(tokens, text): >> ... from difflib import SequenceMatcher >> ... s = SequenceMatcher(None, text, "\t".join(tokens)) >> ... for start, _, length in s.get_matching_blocks(): >> ... if length: >> ... yield start, start + length >> ... >> >>> list(offsets(tokens, text)) >> [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, >> 25)] > > > That's cool, I've never seen that before. If you pass in str.isspace, > you can even drop the "if length:" line:: > > py> def offsets(tokens, text): > ... s = SequenceMatcher(str.isspace, text, '\t'.join(tokens)) > ... for start, _, length in s.get_matching_blocks(): > ... yield start, start + length > ... > py> list(offsets(tokens, text)) > [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, > 25), (25, 25)]
Sorry, that should have been:: list(offsets(tokens, text))[:-1] since the last item is always the zero-length one. Which means you don't really need str.isspace either. STeVe -- http://mail.python.org/mailman/listinfo/python-list