On Tue, 14 Sep 2010 09:08:24 am Joel Goldstick wrote:
> On Mon, Sep 13, 2010 at 6:41 PM, Steven D'Aprano 
<st...@pearwood.info>wrote:
> > On Tue, 14 Sep 2010 04:18:36 am Joel Goldstick wrote:
> > > How about using str.split() to put words in a list, then run
> > > strip() over each word with the required characters to be removed
> > > ('`")
> >
> > Doesn't work. strip() only removes characters at the beginning and
> > end of the word, not in the middle:
>
> Exactly, you first split the words into a list of words, then strip
> each word

Of course, if you don't want to remove ALL punctuation marks, but only 
those at the beginning and end of words, then strip() is a reasonable 
approach. But if the aim is to strip out all punctuation, no matter 
where, then it can't work.

Since the aim is to count words, a better approach might be a hybrid -- 
remove all punctuation marks like commas, fullstops, etc. no matter 
where they appear, keep internal apostrophes so that words like "can't" 
are different from "cant", but remove external ones. Although that 
loses information in the case of (e.g.) dialect speech:

    "'e said 'e were going to kill the lady, Mister Holmes!" 
    cried the lad excitedly.

You probably want to count the word as 'e rather than just e.

And hyphenation is tricky to. A lone hyphen - like these - should be 
deleted. But double-dashes--like these--are word separators, so need to 
be replaced by a space. Otherwise, single hyphens should be kept. If a 
word begins or ends with a hyphen, it should be be joined up with the 
previous or next word. But then it gets more complicated, because you 
don't know whether to keep the hyphen after joining or not.

E.g. if the line ends with: 

blah blah blah blah some-
thing blah blah blah.

should the joined up word become the compound word "some-thing" or the 
regular word "something"? In general, there's no way to be sure, 
although you can make a good guess by looking it up in a dictionary and 
assuming that regular words should be preferred to compound words. But 
that will fail if the word has changed over time, such as "cooperate", 
which until very recently used to be written "co-operate", and before 
that as "coƶperate".



-- 
Steven D'Aprano
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to