On Fri, Jun 1, 2018 at 6:26 AM, beliavsky--- via Python-list <python-list@python.org> wrote: > I bought some e-books in a Humble Bundle. The file names are shown below. I > would like to hyphenate words within the file names, so that the first three > titles are > > a_devils_chaplain.pdf > atomic_accidents.pdf > chaos_making_a_new_science.pdf > > Is there a Python library that uses intelligent guesses to break sequences of > characters into words? The general strategy would be to break strings into > the longest words possible. The library would need to "know" a sizable subset > of words in English. > > adevilschaplain.pdf > atomicaccidents.pdf > chaos_makinganewscience.pdf
Let's start with the easy bit. On many MANY Unix-like systems, you can find a dictionary of words in the user's language (not necessarily English, but that's appropriate here - it means your script will work on a French or German or Turkish or Russian system as well) at /usr/share/dict/words. All you have to do is: with open("/usr/share/dict/words") as f: words = f.read().strip().split("\n") Tada! That'll give you somewhere between 50K and 650K words, for English. (I have eight English dictionaries installed, ranging from american-english-small and british-english-small at 51K all the way up to their corresponding -insane variants at 650K.) Most likely you'll have about 100K words, which is a good number to be working with. If you're on Windows, see if you can just download something from wordlist.sourceforge.net or similar; it should be in the same format. So! Now for the next step. You need to split a pile of letters such that each of the resulting pieces is a word. You're probably going to find some that just don't work ("x-15diary" seems dubious), but for the most part, you should get at least _some_ result. You suggested a general strategy of breaking strings into the longest words possible, which would be easy enough to code. A basic algorithm of "take as many letters as you can while still finding a word" is likely to give you fairly decent results. You'll need a way of backtracking in the event that the rest of the letters don't work ("theedgeofphysics" will take a first word of "thee", but then "dgeofphysics" isn't going to work out well), but otherwise, I think your basic idea is sound. Should be a fun project! ChrisA -- https://mail.python.org/mailman/listinfo/python-list