Assorted remarks inline below: On 08Dec2021 20:39, Julius Hamilton <juliushamilton...@gmail.com> wrote: >deepreader.py: > >import sys >import requests >import html2text >import nltk > >url = sys.argv[1]
I might spell this: cmd, url = sys.argv which enforces exactly one argument. And since you don't care about the command name, maybe: _, url = sys.argv because "_" is a conventional name for "a value we do not care about". >sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text)) Neat! ># Activate an elementary reader interface for the text >for index, sentence in enumerate(sentences): I would be inclined to count from 1, so "enumerate(sentences, 1)". > # Print the sentence > print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence + >“\n”) Personally, since print() adds a trailing newline, I would drop the final +"\n". If you want an additional blank line, I would put it in the input() call below: > # Wait for user key-press > x = input(“\n> “) You're not using "x". Just discard input()'s return value: input("\n> ") >A lot of refining is possible, and I’d really like to see how some more >experienced people might handle it. > >1. The HTML extraction is not perfect. It doesn’t produce as clean text as >I would like. Sometimes random links or tags get left in there. Maybe try beautifulsoup instead of html2text? The module name is "bs4". >And the >sentences are sometimes randomly broken by newlines. I would flatten the newlines. Either the simple: sentence = sentence.strip().replace("\n", " ") or maybe better: sentence = " ".join(sentence.split() Cheers, Cameron Simpson <c...@cskk.id.au> -- https://mail.python.org/mailman/listinfo/python-list