Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-04 Thread Edward Teach via Python-list
On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)
Grant Edwards  wrote:

> On 2024-06-03, Edward Teach via Python-list 
> wrote:
> 
> > The Gutenburg Project publishes "plain text".  That's another
> > problem, because "plain text" means UTF-8and that means
> > unicode...and that means running some sort of unicode-to-ascii
> > conversion in order to get something like "words".  A couple of
> > hoursa couple of hundred lines of Cproblem solved!  
> 
> I'm curious.  Why does it need to be converted frum Unicode to ASCII?
> 
> When you read it into Python, it gets converted right back to
> Unicode...
> 
> 
> 

Well.when using the file linux.words as a useful master list of
"words".linux.words is strict ASCII

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: From JoyceUlysses.txt -- words occurring exactly once

2024-06-03 Thread Edward Teach via Python-list
On Sat, 1 Jun 2024 13:34:11 -0600
Mats Wichmann  wrote:

> On 5/31/24 11:59, Dieter Maurer via Python-list wrote:
> 
> hmmm, I "sent" this but there was some problem and it remained
> unsent. Just in case it hasn't All Been Said Already, here's the
> retry:
> 
> > HenHanna wrote at 2024-5-30 13:03 -0700:  
> >>
> >> Given a text file of a novel (JoyceUlysses.txt) ...
> >>
> >> could someone give me a pretty fast (and simple) Python program
> >> that'd give me a list of all words occurring exactly once?  
> > 
> > Your task can be split into several subtasks:
> >   * parse the text into words
> > 
> > This depends on your notion of "word".
> > In the simplest case, a word is any maximal sequence of
> > non-whitespace characters. In this case, you can use `split` for
> > this task  
> 
> This piece is by far "the hard part", because of the ambiguity. For 
> example, if I just say non-whitespace, then I get as distinct words 
> followed by punctuation. What about hyphenation - of which there's
> both the compound word forms and the ones at the end of lines if the
> source text has been formatted that way.  Are all-lowercase words
> different than the same word starting with a capital?  What about
> non-initial capitals, as happens a fair bit in modern usage with
> acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about
> accented letters?
> 
> If you want what's at least a quick starting point to play with, you 
> could use a very simple regex - a fair amount of thought has gone
> into what a "word character" is (\w), so it deals with excluding both 
> punctuation and whitespace.
> 
> import re
> from collections import Counter
> 
> with open("JoyceUlysses/txt", "r") as f:
>  wordcount = Counter(re.findall(r'\w+', f.read().lower()))
> 
> Now you have a Counter object counting all the "words" with their 
> occurrence counts (by this definition) in the document. You can fish 
> through that to answer the questions asked (find entries with a count
> of 1, 2, 3, etc.)
> 
> Some people Go Big and use something that actually tries to recognize 
> the language, and opposed to making assumptions from ranges of 
> characters.  nltk is a choice there.  But at this point it's not
> really "simple" any longer (though nltk experts might end up
> disagreeing with that).
> 
> 

The Gutenburg Project publishes "plain text".  That's another problem,
because "plain text" means UTF-8and that means unicode...and that
means running some sort of unicode-to-ascii conversion in order to get
something like "words".  A couple of hoursa couple of hundred lines
of Cproblem solved!

-- 
https://mail.python.org/mailman/listinfo/python-list