Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Cameron Simpson
On 08Dec2021 23:17, Stefan Ram  wrote:
>  Regexps might have their disadvantages, but when I use them,
>  it is clearer for me to do all the matching with regexps
>  instead of mixing them with Python calls like str.isupper.
>  Therefore, it is helpful for me to have a regexp to match
>  upper and lower case characters separately. Some regexp
>  dialects support "\p{Lu}" and "\p{Ll}" for this.

Aye. I went looking for that in the Python re module docs and could not 
find them. So the comprimise is match any word, then test the word with 
isupper() (or whatever is appropriate).

>  I have not yet incorporated (all) your advice into my code,
>  but I came to the conclusion myself that the repetition of
>  long sequences like r"A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ" and
>  not using f strings to insert other strings was especially
>  ugly.

The tricky bit with f-strings and regexps is that \w{3,5} means from 3 
through 5 "word characters". So if you've got those in an f-string 
you're off to double-the-brackets land, a bit like double backslash land 
and non-raw-strings.

Otherwise, yes f-strings are a nice way to compose things.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread MRAB

On 2021-12-08 23:17, Stefan Ram wrote:

Cameron Simpson  writes:
Instead, consider the \b (word boundary) and \w (word character) 
markers, which will let you break strings up, and then maybe test the 
results with str.isupper().


   Thanks for your comments, most or all of them are
   valid, and I will try to take them into account!

   Regexps might have their disadvantages, but when I use them,
   it is clearer for me to do all the matching with regexps
   instead of mixing them with Python calls like str.isupper.
   Therefore, it is helpful for me to have a regexp to match
   upper and lower case characters separately. Some regexp
   dialects support "\p{Lu}" and "\p{Ll}" for this.

If you want "\p{Lu}" and "\p{Ll}", have a look at the 'regex' module on 
PyPI:


https://pypi.org/project/regex/

[snip]
--
https://mail.python.org/mailman/listinfo/python-list


Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Peter J. Holzer
On 2021-12-09 09:42:07 +1100, Cameron Simpson wrote:
> On 08Dec2021 21:41, Stefan Ram  wrote:
> >Julius Hamilton  writes:
> >>This is a really simple program which extracts the text from webpages and
> >>displays them one sentence at a time.
> >
> >  Our teacher said NLTK will not come up until next year, so
> >  I tried to do with regexps. It still has bugs, for example
> >  it can not tell the dot at the end of an abbreviation from
> >  the dot at the end of a sentence!
> 
> This is almost a classic demo of why regexps are a poor tool as a first 
> choice. You can do much with them, but they are cryptic and bug prone.

I don't think that's problem here. The problem is that natural languages
just aren't regular languages. In fact I'm not sure that they fit
anywhere within the Chomsky hierarchy (but if they aren't type-0, that
would be a strong argument against the possibility of human-level AI).

In English, if a sentence ends with an abbreviation you write only a
single dot. So if you look at these two fragments:

For matching strings, numbers, etc. Python provides regular
expressions.

Let's say you want to match strings, numbers, etc. Python provides
regular expresssions for these tasks.

In second case the dot ends a sentence in the first it doesn't. But to
distinguish those cases you need to at least parse the sentences at the
syntax level (which regular expressions can't do), maybe even understand
them semantically.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Cameron Simpson
On 08Dec2021 21:41, Stefan Ram  wrote:
>Julius Hamilton  writes:
>>This is a really simple program which extracts the text from webpages and
>>displays them one sentence at a time.
>
>  Our teacher said NLTK will not come up until next year, so
>  I tried to do with regexps. It still has bugs, for example
>  it can not tell the dot at the end of an abbreviation from
>  the dot at the end of a sentence!

This is almost a classic demo of why regexps are a poor tool as a first 
choice. You can do much with them, but they are cryptic and bug prone.

I am not seeking to mock you, but trying to make apparent why regexps 
are to be avoided a lot of the time. They have their place.

You've read the whole re module docs I hope:

https://docs.python.org/3/library/re.html#module-re

>import re
>import urllib.request
>uri = r'''http://example.com/article''' # replace this with your URI!
>request = urllib.request.Request( uri )
>resource = urllib.request.urlopen( request )
>cs = resource.headers.get_content_charset()
>content = resource.read().decode( cs, errors="ignore" )
>content = re.sub( r'''[\r\n\t\s]+''', r''' ''', content )

You're not multiline, so I would recommend a plain raw string:

content = re.sub( r'[\r\n\t\s]+', r' ', content )

No need for \r in the class, \s covers that. From the docs:

  \s
For Unicode (str) patterns:

  Matches Unicode whitespace characters (which includes [ 
  \t\n\r\f\v], and also many other characters, for example the 
  non-breaking spaces mandated by typography rules in many 
  languages). If the ASCII flag is used, only [ \t\n\r\f\v] is 
  matched.

>upper = r"[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]" # "[\\p{Lu}]"
>lower = r"[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]" # "[\\p{Ll}]"

This is very fragile - you have an arbitrary set of additional uppercase 
characters, almost certainly incomplete, and visually hard to inspect 
for completeness.

Instead, consider the \b (word boundary) and \w (word character) 
markers, which will let you break strings up, and then maybe test the 
results with str.isupper().

>digit = r"[0-9]" #"[\\p{Nd}]"

There's a \d character class for this, covers nondecimal digits too.

>firstwordstart = upper;
>firstwordnext = "(?:[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-])";

Again, an inline arbitrary list of characters. This is fragile.

>wordcharacter = "[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïð\
>ñòóôõöøùúûüýþÿ0-9-]"

Again inline. Why not construct it?

wordcharacter = upper + lower + digit

but I recommend \w instead, or for this: [\w\d]

>addition = "(?:(?:[']" + wordcharacter + "+)*[']?)?"

As a matter of good practice with regexp strings, use raw quotes:

addition = r"(?:(?:[']" + wordcharacter + r"+)*[']?)?"

even when there are no backslahes.

Seriously, doing this with regexps is difficult. A useful exercise for 
learning regexps, but in the general case not the first tool to reach 
for.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Jon Ribbens via Python-list
On 2021-12-08, Julius Hamilton  wrote:
> 1. The HTML extraction is not perfect. It doesn’t produce as clean text as
> I would like. Sometimes random links or tags get left in there. And the
> sentences are sometimes randomly broken by newlines.

Oh. Leaving tags in suggests you are doing this very wrongly. Python
has plenty of open source libraries you can use that will parse the
HTML reliably into tags and text for you.

> 2. Neither is the segmentation perfect. I am currently researching
> developing an optimal segmenter with tools from Spacy.
>
> Brevity is greatly valued. I mean, anyone who can make the program more
> perfect, that’s hugely appreciated. But if someone can do it in very few
> lines of code, that’s also appreciated.

It isn't something that can be done in a few lines of code. There's the
spaces issue you mention for example. Nor is it something that can
necessarily be done just by inspecting the HTML alone. To take a trivial
example:

  powergenitalia  = powergen  italia

but:

  powergenitalia= powergenitalia

but the second with the addition of:

  span { dispaly: block }

is back to "powergen  italia". So you need to parse and apply styles
(including external stylesheets) as well. Potentially you may also need
to execute JavaScript on the page, which means you also need a JavaScript
interpreter and a DOM implementation. Basically you need a complete
browser to do it on general web pages.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread Cameron Simpson
Assorted remarks inline below:

On 08Dec2021 20:39, Julius Hamilton  wrote:
>deepreader.py:
>
>import sys
>import requests
>import html2text
>import nltk
>
>url = sys.argv[1]

I might spell this:

cmd, url = sys.argv

which enforces exactly one argument. And since you don't care about the 
command name, maybe:

_, url = sys.argv

because "_" is a conventional name for "a value we do not care about".

>sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text))

Neat!

># Activate an elementary reader interface for the text
>for index, sentence in enumerate(sentences):

I would be inclined to count from 1, so "enumerate(sentences, 1)".

>  # Print the sentence
>  print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence +
>“\n”)

Personally, since print() adds a trailing newline, I would drop the 
final +"\n". If you want an additional blank line, I would put it in the 
input() call below:

>  # Wait for user key-press
>  x = input(“\n> “)

You're not using "x". Just discard input()'s return value:

input("\n> ")

>A lot of refining is possible, and I’d really like to see how some more
>experienced people might handle it.
>
>1. The HTML extraction is not perfect. It doesn’t produce as clean text as
>I would like. Sometimes random links or tags get left in there.

Maybe try beautifulsoup instead of html2text? The module name is "bs4".

>And the
>sentences are sometimes randomly broken by newlines.

I would flatten the newlines. Either the simple:

sentence = sentence.strip().replace("\n", " ")

or maybe better:

sentence = " ".join(sentence.split()

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Short, perfect program to read sentences of webpage

2021-12-08 Thread MRAB

On 2021-12-08 19:39, Julius Hamilton wrote:

Hey,

This is something I have been working on for a very long time. It’s one of
the reasons I got into programming at all. I’d really appreciate if people
could input some advice on this.

This is a really simple program which extracts the text from webpages and
displays them one sentence at a time. It’s meant to help you study dense
material, especially documentation, with much more focus and comprehension.
I actually hope it can be of help to people who have difficulty reading. I
know it’s been of use to me at least.

This is a minimally acceptable way to pull it off currently:

deepreader.py:

import sys
import requests
import html2text
import nltk

url = sys.argv[1]

# Get the html, pull out the text, and sentence-segment it in one line of
code

sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text))

# Activate an elementary reader interface for the text

for index, sentence in enumerate(sentences):

   # Print the sentence
   print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence +
“\n”)


You can shorten that with format strings:

print("\n{}/{}: {}\n".format(index, len(sentences), sentence))

or even:

print(f"\n{index}/{len(sentences)}: {sentence}\n")


   # Wait for user key-press
   x = input(“\n> “)


EOF



That’s it.

A lot of refining is possible, and I’d really like to see how some more
experienced people might handle it.

1. The HTML extraction is not perfect. It doesn’t produce as clean text as
I would like. Sometimes random links or tags get left in there. And the
sentences are sometimes randomly broken by newlines.

2. Neither is the segmentation perfect. I am currently researching
developing an optimal segmenter with tools from Spacy.

Brevity is greatly valued. I mean, anyone who can make the program more
perfect, that’s hugely appreciated. But if someone can do it in very few
lines of code, that’s also appreciated.

Thanks very much,
Julius



--
https://mail.python.org/mailman/listinfo/python-list


Short, perfect program to read sentences of webpage

2021-12-08 Thread Julius Hamilton
Hey,

This is something I have been working on for a very long time. It’s one of
the reasons I got into programming at all. I’d really appreciate if people
could input some advice on this.

This is a really simple program which extracts the text from webpages and
displays them one sentence at a time. It’s meant to help you study dense
material, especially documentation, with much more focus and comprehension.
I actually hope it can be of help to people who have difficulty reading. I
know it’s been of use to me at least.

This is a minimally acceptable way to pull it off currently:

deepreader.py:

import sys
import requests
import html2text
import nltk

url = sys.argv[1]

# Get the html, pull out the text, and sentence-segment it in one line of
code

sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text))

# Activate an elementary reader interface for the text

for index, sentence in enumerate(sentences):

  # Print the sentence
  print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence +
“\n”)

  # Wait for user key-press
  x = input(“\n> “)


EOF



That’s it.

A lot of refining is possible, and I’d really like to see how some more
experienced people might handle it.

1. The HTML extraction is not perfect. It doesn’t produce as clean text as
I would like. Sometimes random links or tags get left in there. And the
sentences are sometimes randomly broken by newlines.

2. Neither is the segmentation perfect. I am currently researching
developing an optimal segmenter with tools from Spacy.

Brevity is greatly valued. I mean, anyone who can make the program more
perfect, that’s hugely appreciated. But if someone can do it in very few
lines of code, that’s also appreciated.

Thanks very much,
Julius
-- 
https://mail.python.org/mailman/listinfo/python-list