On Fri, Jun 1, 2018 at 6:26 AM, beliavsky--- via Python-list
<python-list@python.org> wrote:
> I bought some e-books in a Humble Bundle. The file names are shown below. I 
> would like to hyphenate words within the file names, so that the first three 
> titles are
>
> a_devils_chaplain.pdf
> atomic_accidents.pdf
> chaos_making_a_new_science.pdf
>
> Is there a Python library that uses intelligent guesses to break sequences of 
> characters into words? The general strategy would be to break strings into 
> the longest words possible. The library would need to "know" a sizable subset 
> of words in English.
>
> adevilschaplain.pdf
> atomicaccidents.pdf
> chaos_makinganewscience.pdf

Let's start with the easy bit. On many MANY Unix-like systems, you can
find a dictionary of words in the user's language (not necessarily
English, but that's appropriate here - it means your script will work
on a French or German or Turkish or Russian system as well) at
/usr/share/dict/words. All you have to do is:

with open("/usr/share/dict/words") as f:
    words = f.read().strip().split("\n")

Tada! That'll give you somewhere between 50K and 650K words, for
English. (I have eight English dictionaries installed, ranging from
american-english-small and british-english-small at 51K all the way up
to their corresponding -insane variants at 650K.) Most likely you'll
have about 100K words, which is a good number to be working with. If
you're on Windows, see if you can just download something from
wordlist.sourceforge.net or similar; it should be in the same format.

So! Now for the next step. You need to split a pile of letters such
that each of the resulting pieces is a word. You're probably going to
find some that just don't work ("x-15diary" seems dubious), but for
the most part, you should get at least _some_ result. You suggested a
general strategy of breaking strings into the longest words possible,
which would be easy enough to code. A basic algorithm of "take as many
letters as you can while still finding a word" is likely to give you
fairly decent results. You'll need a way of backtracking in the event
that the rest of the letters don't work ("theedgeofphysics" will take
a first word of "thee", but then "dgeofphysics" isn't going to work
out well), but otherwise, I think your basic idea is sound.

Should be a fun project!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to