On 08Dec2021 21:41, Stefan Ram <r...@zedat.fu-berlin.de> wrote: >Julius Hamilton <juliushamilton...@gmail.com> writes: >>This is a really simple program which extracts the text from webpages and >>displays them one sentence at a time. > > Our teacher said NLTK will not come up until next year, so > I tried to do with regexps. It still has bugs, for example > it can not tell the dot at the end of an abbreviation from > the dot at the end of a sentence!
This is almost a classic demo of why regexps are a poor tool as a first choice. You can do much with them, but they are cryptic and bug prone. I am not seeking to mock you, but trying to make apparent why regexps are to be avoided a lot of the time. They have their place. You've read the whole re module docs I hope: https://docs.python.org/3/library/re.html#module-re >import re >import urllib.request >uri = r'''http://example.com/article''' # replace this with your URI! >request = urllib.request.Request( uri ) >resource = urllib.request.urlopen( request ) >cs = resource.headers.get_content_charset() >content = resource.read().decode( cs, errors="ignore" ) >content = re.sub( r'''[\r\n\t\s]+''', r''' ''', content ) You're not multiline, so I would recommend a plain raw string: content = re.sub( r'[\r\n\t\s]+', r' ', content ) No need for \r in the class, \s covers that. From the docs: \s For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched. >upper = r"[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]" # "[\\p{Lu}]" >lower = r"[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]" # "[\\p{Ll}]" This is very fragile - you have an arbitrary set of additional uppercase characters, almost certainly incomplete, and visually hard to inspect for completeness. Instead, consider the \b (word boundary) and \w (word character) markers, which will let you break strings up, and then maybe test the results with str.isupper(). >digit = r"[0-9]" #"[\\p{Nd}]" There's a \d character class for this, covers nondecimal digits too. >firstwordstart = upper; >firstwordnext = "(?:[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-])"; Again, an inline arbitrary list of characters. This is fragile. >wordcharacter = "[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïð\ >ñòóôõöøùúûüýþÿ0-9-]" Again inline. Why not construct it? wordcharacter = upper + lower + digit but I recommend \w instead, or for this: [\w\d] >addition = "(?:(?:[']" + wordcharacter + "+)*[']?)?" As a matter of good practice with regexp strings, use raw quotes: addition = r"(?:(?:[']" + wordcharacter + r"+)*[']?)?" even when there are no backslahes. Seriously, doing this with regexps is difficult. A useful exercise for learning regexps, but in the general case not the first tool to reach for. Cheers, Cameron Simpson <c...@cskk.id.au> -- https://mail.python.org/mailman/listinfo/python-list