>> Well.....when using the file linux.words as a useful master list of
>> "words".....linux.words is strict ASCII........

The meaning of "words" depends on the context. The contents of the file
mentioned are a minor attempt to capture a common subset of words in English
but probably are not what you mean by words in other contexts including
words also in ASCII format  like names and especially uncommon names or
words like UNESCO. There are other selected lists of words such as valid
Scrabble words or WORLDLE words for specialized purposes that exclude words
of lengths that can not be used. The person looking to count words in a work
must determine what words make sense for their purpose.

ASCII is a small subset of UNICODE. So when using a concept of word that
includes many characters from many character sets, and in many languages,
things may not be easy to parse uniquely such as words containing something
like an apostrophe earlier on as in d'eau. Words can flow in different
directions. There can be fairly complex rules and sometimes things like
compound words may need to be considered to either be one or multiple words
and may even occur both ways in the same work so is every body the same as
everybody?

So what is being discussed here may have several components. One is to
tokenize all the text to make a set of categories. Another is to count them.
Perhaps another might even analyze and combine multiple categories or even
look at words in context to determine if two uses of the same word are
different enough to try to keep both apart in two categories Is polish the
same as Polish?

Once that is decided, you have a fairly simple exercise in storing the data
in a searchable data structure and doing your searches to get subsets and
counts and so on.

As mentioned, the default native format in Python is UNICODE and ASCII files
being read in may well be UNICODE internally unless you carefully ask
otherwise. The conversion from ASCII to UNICODE is trivial. 

As for how well the regular expressions like \w work in general, I have no
idea. I can be very sure they are way more costly than the simpler ones you
can write that just know enough about what English words in ASCII look like
and perhaps get it wrong on some edge cases.


-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail....@python.org> On
Behalf Of Edward Teach via Python-list
Sent: Tuesday, June 4, 2024 7:22 AM
To: python-list@python.org
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once

On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
Grant Edwards <grant.b.edwa...@gmail.com> wrote:

> On 2024-06-03, Edward Teach via Python-list <python-list@python.org>
> wrote:
> 
> > The Gutenburg Project publishes "plain text".  That's another
> > problem, because "plain text" means UTF-8....and that means
> > unicode...and that means running some sort of unicode-to-ascii
> > conversion in order to get something like "words".  A couple of
> > hours....a couple of hundred lines of C....problem solved!  
> 
> I'm curious.  Why does it need to be converted frum Unicode to ASCII?
> 
> When you read it into Python, it gets converted right back to
> Unicode...
> 
> 
> 

Well.....when using the file linux.words as a useful master list of
"words".....linux.words is strict ASCII........

-- 
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to