"Steven Bethard" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > I've got a list of word substrings (the "tokens") which I need to align > to a string of text (the "sentence"). The sentence is basically the > concatenation of the token list, with spaces sometimes inserted beetween > tokens. I need to determine the start and end offsets of each token in > the sentence. For example:: > > py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?'] > py> text = '''\ > ... She's gonna write > ... a book?''' > py> list(offsets(tokens, text)) > [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)] >
Hey, I get the same answer with this: =================== from pyparsing import oneOf tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?'] text = '''\ She's gonna write a book?''' tokenlist = oneOf( " ".join(tokens) ) offsets = [(start,end) for token,start,end in tokenlist.scanString(text) ] print offsets =================== [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)] Of course, pyparsing may be a bit heavyweight to drag into a simple function like this, and certainly not near as fast as regexp. But it was such a nice way to show how scanString works. Pyparsing's "oneOf" helper function takes care of the same longest match issues that Fredrik Lundh handles using sort, reverse, etc. Just so long as none of the tokens has an embedded space character. -- Paul -- http://mail.python.org/mailman/listinfo/python-list