Re: Scanning a file character by character

Tim Chase Tue, 10 Feb 2009 14:47:13 -0800

Or for a slightly less simple minded splitting you could try re.split:
re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]
['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
Perhaps I'm missing something, but the above regex does the exact samething as line.split() except it is significantly slower and harder toread.
Neither deal with quoted text, apostrophes, hyphens, punctuation or anyother details of real-world text. That's what I mean by "simple-minded".


  >>> s = "The quick brown fox jumps, and falls over."
  >>> import re
  >>> re.split(r"(\w+)", s)[1::2]
  ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']
  >>> s.split()

['The', 'quick', 'brown', 'fox', 'jumps,', 'and', 'falls','over.']

Note the difference in "jumps" vs. "jumps," (extra comma in thestring.split() version) and likewise the period after "over".Thus not quite "the exact same thing as line.split()".


I think an easier-to-read variant would be

  >>> re.findall(r"\w+", s)
  ['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']

which just finds words.  One could also just limit it to letters with

  re.findall("[a-zA-Z]", s)

as "\w" is a little more encompassing (letters and underscores)if that's a problem.


-tkc




--
http://mail.python.org/mailman/listinfo/python-list

Re: Scanning a file character by character

Reply via email to