Re: Scanning a file character by character

Rhodri James Tue, 10 Feb 2009 14:48:11 -0800

On Tue, 10 Feb 2009 22:02:57 -0000, Steven D'Aprano<ste...@remove.this.cybersource.com.au> wrote:

On Tue, 10 Feb 2009 12:06:06 +0000, Duncan Booth wrote:

Steven D'Aprano <ste...@remove.this.cybersource.com.au> wrote:

On Mon, 09 Feb 2009 19:10:28 -0800, Spacebar265 wrote:

How would I do separate lines into words without scanning one
character at a time?


Scan a line at a time, then split each line into words.


for line in open('myfile.txt'):
    words = line.split()


should work for a particularly simple-minded idea of words.

Or for a slightly less simple minded splitting you could try re.split:

re.split("(\w+)", "The quick brown fox jumps, and falls over.")[1::2]

['The', 'quick', 'brown', 'fox', 'jumps', 'and', 'falls', 'over']



Perhaps I'm missing something, but the above regex does the exact same
thing as line.split() except it is significantly slower and harder to
read.

Neither deal with quoted text, apostrophes, hyphens, punctuation or any
other details of real-world text. That's what I mean by "simple-minded".


You're missing something :-)  Specifically, the punctuation gets swept
up with the whitespace, and the extended slice skips it.  Apostrophes
(and possibly hyphenation) are still a bit moot, though.



--
Rhodri James *-* Wildebeeste Herder to the Masses
--
http://mail.python.org/mailman/listinfo/python-list

Re: Scanning a file character by character

Reply via email to