On 4 Aug, 12:34, Fred Mangusta <[EMAIL PROTECTED]> wrote: > > thanks for replying. I'm interested in knowing more about your regex > approach, but as you point out in your comment, seems like access to the > sourceforge mail archive is restricted. Is there any way I can read > about it? Would you be so kind to cut and paste it here for instance?
I can't log into SourceForge, possibly because I've forgotten my password, but I can give you a fairly similar regular expression which does some of the work: sentence_pattern = re.compile( r'(' + r'[\(\"\[]*' + # Quoting or bracketing (optional) r'[A-Z,a-z,0-9]' + # Match sentence with specific start character r'.+?' + # Match sentence content - "?" means non- greedy r'[\.\!\?]' + # End of sentence r'[\)\"\]]*' + # End quoting or bracketing r')' + r'(\s+)' + # Spaces r'[\(\"\[]*' + # Quoting or bracketing (optional) r'[A-Z,0-9]' # Match sentence with specific start character ) This is mostly the same as that posted to SourceForge, but with some enhancements; I've indented the part which actually produces the matched sentence text in a group. Unfortunately, some postprocessing is required to deal with abbreviations, and I maintain a list of these against which I test the supposed ends of sentences that the regular expression provides. In addition, I also try and detect initials (eg. G. van Rossum) which the regular expression may regard as the end of a sentence. As I noted, I'd be interested to hear of any better solutions which don't involve training. Paul -- http://mail.python.org/mailman/listinfo/python-list