Re: Short, perfect program to read sentences of webpage

Peter J. Holzer Wed, 08 Dec 2021 15:12:37 -0800

On 2021-12-09 09:42:07 +1100, Cameron Simpson wrote:
> On 08Dec2021 21:41, Stefan Ram <[email protected]> wrote:
> >Julius Hamilton <[email protected]> writes:
> >>This is a really simple program which extracts the text from webpages and
> >>displays them one sentence at a time.
> >
> >  Our teacher said NLTK will not come up until next year, so
> >  I tried to do with regexps. It still has bugs, for example
> >  it can not tell the dot at the end of an abbreviation from
> >  the dot at the end of a sentence!
> 
> This is almost a classic demo of why regexps are a poor tool as a first 
> choice. You can do much with them, but they are cryptic and bug prone.


I don't think that's problem here. The problem is that natural languages
just aren't regular languages. In fact I'm not sure that they fit
anywhere within the Chomsky hierarchy (but if they aren't type-0, that
would be a strong argument against the possibility of human-level AI).

In English, if a sentence ends with an abbreviation you write only a
single dot. So if you look at these two fragments:

    For matching strings, numbers, etc. Python provides regular
    expressions.

    Let's say you want to match strings, numbers, etc. Python provides
    regular expresssions for these tasks.

In second case the dot ends a sentence in the first it doesn't. But to
distinguish those cases you need to at least parse the sentences at the
syntax level (which regular expressions can't do), maybe even understand
them semantically.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [email protected]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Short, perfect program to read sentences of webpage

Reply via email to