Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Cameron Simpson
On 08Dec2021 23:17, Stefan Ram wrote: > Regexps might have their disadvantages, but when I use them, > it is clearer for me to do all the matching with regexps > instead of mixing them with Python calls like str.isupper. > Therefore, it is helpful for me to have a regexp to match > upper and

Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread MRAB
On 2021-12-08 23:17, Stefan Ram wrote: Cameron Simpson writes: Instead, consider the \b (word boundary) and \w (word character) markers, which will let you break strings up, and then maybe test the results with str.isupper(). Thanks for your comments, most or all of them are valid,

Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Peter J. Holzer
On 2021-12-09 09:42:07 +1100, Cameron Simpson wrote: > On 08Dec2021 21:41, Stefan Ram wrote: > >Julius Hamilton writes: > >>This is a really simple program which extracts the text from webpages and > >>displays them one sentence at a time. > > > > Our teacher said NLTK will not come up until

Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Cameron Simpson
On 08Dec2021 21:41, Stefan Ram wrote: >Julius Hamilton writes: >>This is a really simple program which extracts the text from webpages and >>displays them one sentence at a time. > > Our teacher said NLTK will not come up until next year, so > I tried to do with regexps. It still has bugs, for

Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Jon Ribbens via Python-list
On 2021-12-08, Julius Hamilton wrote: > 1. The HTML extraction is not perfect. It doesn’t produce as clean text as > I would like. Sometimes random links or tags get left in there. And the > sentences are sometimes randomly broken by newlines. Oh. Leaving tags in suggests you are doing this very

Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Cameron Simpson
Assorted remarks inline below: On 08Dec2021 20:39, Julius Hamilton wrote: >deepreader.py: > >import sys >import requests >import html2text >import nltk > >url = sys.argv[1] I might spell this: cmd, url = sys.argv which enforces exactly one argument. And since you don't care about the

Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread MRAB
On 2021-12-08 19:39, Julius Hamilton wrote: Hey, This is something I have been working on for a very long time. It’s one of the reasons I got into programming at all. I’d really appreciate if people could input some advice on this. This is a really simple program which extracts the text from