Or for a slightly less simple minded splitting you could try re.split:
re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
Perhaps I'm missing something, but the above regex does the exact same
thing as line.split() except it is significantly slower and harder to
read.
Neither deal with quoted text, apostrophes, hyphens, punctuation or any
other details of real-world text. That's what I mean by "simple-minded".
>>> s = "The quick brown fox jumps, and falls over."
>>> import re
>>> re.split(r"(\w+)", s)[1::2]
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
>>> s.split()
['The', 'quick', 'brown', 'fox', 'jumps,', 'and', 'falls',
'over.']
Note the difference in "jumps" vs. "jumps," (extra comma in the
string.split() version) and likewise the period after "over".
Thus not quite "the exact same thing as line.split()".
I think an easier-to-read variant would be
>>> re.findall(r"\w+", s)
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
which just finds words. One could also just limit it to letters with
re.findall("[a-zA-Z]", s)
as "\w" is a little more encompassing (letters and underscores)
if that's a problem.
-tkc
--
http://mail.python.org/mailman/listinfo/python-list