Re: Short, perfect program to read sentences of webpage
On 08Dec2021 23:17, Stefan Ram wrote: > Regexps might have their disadvantages, but when I use them, > it is clearer for me to do all the matching with regexps > instead of mixing them with Python calls like str.isupper. > Therefore, it is helpful for me to have a regexp to match > upper and lower case characters separately. Some regexp > dialects support "\p{Lu}" and "\p{Ll}" for this. Aye. I went looking for that in the Python re module docs and could not find them. So the comprimise is match any word, then test the word with isupper() (or whatever is appropriate). > I have not yet incorporated (all) your advice into my code, > but I came to the conclusion myself that the repetition of > long sequences like r"A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ" and > not using f strings to insert other strings was especially > ugly. The tricky bit with f-strings and regexps is that \w{3,5} means from 3 through 5 "word characters". So if you've got those in an f-string you're off to double-the-brackets land, a bit like double backslash land and non-raw-strings. Otherwise, yes f-strings are a nice way to compose things. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: Short, perfect program to read sentences of webpage
On 2021-12-08 23:17, Stefan Ram wrote: Cameron Simpson writes: Instead, consider the \b (word boundary) and \w (word character) markers, which will let you break strings up, and then maybe test the results with str.isupper(). Thanks for your comments, most or all of them are valid, and I will try to take them into account! Regexps might have their disadvantages, but when I use them, it is clearer for me to do all the matching with regexps instead of mixing them with Python calls like str.isupper. Therefore, it is helpful for me to have a regexp to match upper and lower case characters separately. Some regexp dialects support "\p{Lu}" and "\p{Ll}" for this. If you want "\p{Lu}" and "\p{Ll}", have a look at the 'regex' module on PyPI: https://pypi.org/project/regex/ [snip] -- https://mail.python.org/mailman/listinfo/python-list
Re: Short, perfect program to read sentences of webpage
On 2021-12-09 09:42:07 +1100, Cameron Simpson wrote: > On 08Dec2021 21:41, Stefan Ram wrote: > >Julius Hamilton writes: > >>This is a really simple program which extracts the text from webpages and > >>displays them one sentence at a time. > > > > Our teacher said NLTK will not come up until next year, so > > I tried to do with regexps. It still has bugs, for example > > it can not tell the dot at the end of an abbreviation from > > the dot at the end of a sentence! > > This is almost a classic demo of why regexps are a poor tool as a first > choice. You can do much with them, but they are cryptic and bug prone. I don't think that's problem here. The problem is that natural languages just aren't regular languages. In fact I'm not sure that they fit anywhere within the Chomsky hierarchy (but if they aren't type-0, that would be a strong argument against the possibility of human-level AI). In English, if a sentence ends with an abbreviation you write only a single dot. So if you look at these two fragments: For matching strings, numbers, etc. Python provides regular expressions. Let's say you want to match strings, numbers, etc. Python provides regular expresssions for these tasks. In second case the dot ends a sentence in the first it doesn't. But to distinguish those cases you need to at least parse the sentences at the syntax level (which regular expressions can't do), maybe even understand them semantically. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Short, perfect program to read sentences of webpage
On 08Dec2021 21:41, Stefan Ram wrote: >Julius Hamilton writes: >>This is a really simple program which extracts the text from webpages and >>displays them one sentence at a time. > > Our teacher said NLTK will not come up until next year, so > I tried to do with regexps. It still has bugs, for example > it can not tell the dot at the end of an abbreviation from > the dot at the end of a sentence! This is almost a classic demo of why regexps are a poor tool as a first choice. You can do much with them, but they are cryptic and bug prone. I am not seeking to mock you, but trying to make apparent why regexps are to be avoided a lot of the time. They have their place. You've read the whole re module docs I hope: https://docs.python.org/3/library/re.html#module-re >import re >import urllib.request >uri = r'''http://example.com/article''' # replace this with your URI! >request = urllib.request.Request( uri ) >resource = urllib.request.urlopen( request ) >cs = resource.headers.get_content_charset() >content = resource.read().decode( cs, errors="ignore" ) >content = re.sub( r'''[\r\n\t\s]+''', r''' ''', content ) You're not multiline, so I would recommend a plain raw string: content = re.sub( r'[\r\n\t\s]+', r' ', content ) No need for \r in the class, \s covers that. From the docs: \s For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched. >upper = r"[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]" # "[\\p{Lu}]" >lower = r"[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]" # "[\\p{Ll}]" This is very fragile - you have an arbitrary set of additional uppercase characters, almost certainly incomplete, and visually hard to inspect for completeness. Instead, consider the \b (word boundary) and \w (word character) markers, which will let you break strings up, and then maybe test the results with str.isupper(). >digit = r"[0-9]" #"[\\p{Nd}]" There's a \d character class for this, covers nondecimal digits too. >firstwordstart = upper; >firstwordnext = "(?:[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-])"; Again, an inline arbitrary list of characters. This is fragile. >wordcharacter = "[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïð\ >ñòóôõöøùúûüýþÿ0-9-]" Again inline. Why not construct it? wordcharacter = upper + lower + digit but I recommend \w instead, or for this: [\w\d] >addition = "(?:(?:[']" + wordcharacter + "+)*[']?)?" As a matter of good practice with regexp strings, use raw quotes: addition = r"(?:(?:[']" + wordcharacter + r"+)*[']?)?" even when there are no backslahes. Seriously, doing this with regexps is difficult. A useful exercise for learning regexps, but in the general case not the first tool to reach for. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: Short, perfect program to read sentences of webpage
On 2021-12-08, Julius Hamilton wrote: > 1. The HTML extraction is not perfect. It doesn’t produce as clean text as > I would like. Sometimes random links or tags get left in there. And the > sentences are sometimes randomly broken by newlines. Oh. Leaving tags in suggests you are doing this very wrongly. Python has plenty of open source libraries you can use that will parse the HTML reliably into tags and text for you. > 2. Neither is the segmentation perfect. I am currently researching > developing an optimal segmenter with tools from Spacy. > > Brevity is greatly valued. I mean, anyone who can make the program more > perfect, that’s hugely appreciated. But if someone can do it in very few > lines of code, that’s also appreciated. It isn't something that can be done in a few lines of code. There's the spaces issue you mention for example. Nor is it something that can necessarily be done just by inspecting the HTML alone. To take a trivial example: powergenitalia = powergen italia but: powergenitalia= powergenitalia but the second with the addition of: span { dispaly: block } is back to "powergen italia". So you need to parse and apply styles (including external stylesheets) as well. Potentially you may also need to execute JavaScript on the page, which means you also need a JavaScript interpreter and a DOM implementation. Basically you need a complete browser to do it on general web pages. -- https://mail.python.org/mailman/listinfo/python-list
Re: Short, perfect program to read sentences of webpage
Assorted remarks inline below: On 08Dec2021 20:39, Julius Hamilton wrote: >deepreader.py: > >import sys >import requests >import html2text >import nltk > >url = sys.argv[1] I might spell this: cmd, url = sys.argv which enforces exactly one argument. And since you don't care about the command name, maybe: _, url = sys.argv because "_" is a conventional name for "a value we do not care about". >sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text)) Neat! ># Activate an elementary reader interface for the text >for index, sentence in enumerate(sentences): I would be inclined to count from 1, so "enumerate(sentences, 1)". > # Print the sentence > print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence + >“\n”) Personally, since print() adds a trailing newline, I would drop the final +"\n". If you want an additional blank line, I would put it in the input() call below: > # Wait for user key-press > x = input(“\n> “) You're not using "x". Just discard input()'s return value: input("\n> ") >A lot of refining is possible, and I’d really like to see how some more >experienced people might handle it. > >1. The HTML extraction is not perfect. It doesn’t produce as clean text as >I would like. Sometimes random links or tags get left in there. Maybe try beautifulsoup instead of html2text? The module name is "bs4". >And the >sentences are sometimes randomly broken by newlines. I would flatten the newlines. Either the simple: sentence = sentence.strip().replace("\n", " ") or maybe better: sentence = " ".join(sentence.split() Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: Short, perfect program to read sentences of webpage
On 2021-12-08 19:39, Julius Hamilton wrote: Hey, This is something I have been working on for a very long time. It’s one of the reasons I got into programming at all. I’d really appreciate if people could input some advice on this. This is a really simple program which extracts the text from webpages and displays them one sentence at a time. It’s meant to help you study dense material, especially documentation, with much more focus and comprehension. I actually hope it can be of help to people who have difficulty reading. I know it’s been of use to me at least. This is a minimally acceptable way to pull it off currently: deepreader.py: import sys import requests import html2text import nltk url = sys.argv[1] # Get the html, pull out the text, and sentence-segment it in one line of code sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text)) # Activate an elementary reader interface for the text for index, sentence in enumerate(sentences): # Print the sentence print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence + “\n”) You can shorten that with format strings: print("\n{}/{}: {}\n".format(index, len(sentences), sentence)) or even: print(f"\n{index}/{len(sentences)}: {sentence}\n") # Wait for user key-press x = input(“\n> “) EOF That’s it. A lot of refining is possible, and I’d really like to see how some more experienced people might handle it. 1. The HTML extraction is not perfect. It doesn’t produce as clean text as I would like. Sometimes random links or tags get left in there. And the sentences are sometimes randomly broken by newlines. 2. Neither is the segmentation perfect. I am currently researching developing an optimal segmenter with tools from Spacy. Brevity is greatly valued. I mean, anyone who can make the program more perfect, that’s hugely appreciated. But if someone can do it in very few lines of code, that’s also appreciated. Thanks very much, Julius -- https://mail.python.org/mailman/listinfo/python-list
Short, perfect program to read sentences of webpage
Hey, This is something I have been working on for a very long time. It’s one of the reasons I got into programming at all. I’d really appreciate if people could input some advice on this. This is a really simple program which extracts the text from webpages and displays them one sentence at a time. It’s meant to help you study dense material, especially documentation, with much more focus and comprehension. I actually hope it can be of help to people who have difficulty reading. I know it’s been of use to me at least. This is a minimally acceptable way to pull it off currently: deepreader.py: import sys import requests import html2text import nltk url = sys.argv[1] # Get the html, pull out the text, and sentence-segment it in one line of code sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text)) # Activate an elementary reader interface for the text for index, sentence in enumerate(sentences): # Print the sentence print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence + “\n”) # Wait for user key-press x = input(“\n> “) EOF That’s it. A lot of refining is possible, and I’d really like to see how some more experienced people might handle it. 1. The HTML extraction is not perfect. It doesn’t produce as clean text as I would like. Sometimes random links or tags get left in there. And the sentences are sometimes randomly broken by newlines. 2. Neither is the segmentation perfect. I am currently researching developing an optimal segmenter with tools from Spacy. Brevity is greatly valued. I mean, anyone who can make the program more perfect, that’s hugely appreciated. But if someone can do it in very few lines of code, that’s also appreciated. Thanks very much, Julius -- https://mail.python.org/mailman/listinfo/python-list