Or for a slightly less simple minded splitting you could try re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']


Perhaps I'm missing something, but the above regex does the exact same thing as line.split() except it is significantly slower and harder to read.

Neither deal with quoted text, apostrophes, hyphens, punctuation or any other details of real-world text. That's what I mean by "simple-minded".

  >>> s = "The quick brown fox jumps, and falls over."
  >>> import re
  >>> re.split(r"(\w+)", s)[1::2]
  ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
  >>> s.split()
['The', 'quick', 'brown', 'fox', 'jumps,', 'and', 'falls', 'over.']

Note the difference in "jumps" vs. "jumps," (extra comma in the string.split() version) and likewise the period after "over". Thus not quite "the exact same thing as line.split()".

I think an easier-to-read variant would be

  >>> re.findall(r"\w+", s)
  ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

which just finds words.  One could also just limit it to letters with

  re.findall("[a-zA-Z]", s)

as "\w" is a little more encompassing (letters and underscores) if that's a problem.

-tkc




--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to