On Tue, 14 Sep 2010 09:08:24 am Joel Goldstick wrote: > On Mon, Sep 13, 2010 at 6:41 PM, Steven D'Aprano <st...@pearwood.info>wrote: > > On Tue, 14 Sep 2010 04:18:36 am Joel Goldstick wrote: > > > How about using str.split() to put words in a list, then run > > > strip() over each word with the required characters to be removed > > > ('`") > > > > Doesn't work. strip() only removes characters at the beginning and > > end of the word, not in the middle: > > Exactly, you first split the words into a list of words, then strip > each word
Of course, if you don't want to remove ALL punctuation marks, but only those at the beginning and end of words, then strip() is a reasonable approach. But if the aim is to strip out all punctuation, no matter where, then it can't work. Since the aim is to count words, a better approach might be a hybrid -- remove all punctuation marks like commas, fullstops, etc. no matter where they appear, keep internal apostrophes so that words like "can't" are different from "cant", but remove external ones. Although that loses information in the case of (e.g.) dialect speech: "'e said 'e were going to kill the lady, Mister Holmes!" cried the lad excitedly. You probably want to count the word as 'e rather than just e. And hyphenation is tricky to. A lone hyphen - like these - should be deleted. But double-dashes--like these--are word separators, so need to be replaced by a space. Otherwise, single hyphens should be kept. If a word begins or ends with a hyphen, it should be be joined up with the previous or next word. But then it gets more complicated, because you don't know whether to keep the hyphen after joining or not. E.g. if the line ends with: blah blah blah blah some- thing blah blah blah. should the joined up word become the compound word "some-thing" or the regular word "something"? In general, there's no way to be sure, although you can make a good guess by looking it up in a dictionary and assuming that regular words should be preferred to compound words. But that will fail if the word has changed over time, such as "cooperate", which until very recently used to be written "co-operate", and before that as "coƶperate". -- Steven D'Aprano _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor