[Martin MOKREJÅ] > just imagine, you want to compare how many words are in English, German, > Czech, Polish disctionary. You collect words from every language and record > them in dict or Set, as you wish.
Call the set of all English words E; G, C, and P similarly. > Once you have those Set's or dict's for those 4 languages, you ask > for common words This Python expression then gives the set of words common to all 4: E & G & C & P > and for those unique to Polish. P - E - G - C is a reasonably efficient way to compute that. > I have no estimates > of real-world numbers, but we might be in range of 1E6 or 1E8? > I believe in any case, huge. No matter how large, it's utterly tiny compared to the number of character strings that *aren't* words in any of these languages. English has a lot of words, but nobody estimates it at over 2 million (including scientific jargon, like names for chemical compounds): http://www.worldwidewords.org/articles/howmany.htm > My concern is actually purely scientific, not really related to analysis > of these 4 languages, but I believe it describes my intent quite well. > > I wanted to be able to get a list of words NOT found in say Polish, > and therefore wanted to have a list of all, theoretically existing words. > In principle, I can drop this idea of having ideal, theoretical lexicon. > But have to store those real-world dictionaries anyway to hard drive. Real-word dictionaries shouldn't be a problem. I recommend you store each as a plain text file, one word per line. Then, e.g., to convert that into a set of words, do f = open('EnglishWords.txt') set_of_English_words = set(f) f.close() You'll have a trailing newline character in each word, but that doesn't really matter. Note that if you sort the word-per-line text files first, the Unix `comm` utility can be used to perform intersection and difference on a pair at a time with virtually no memory burden (and regardless of file size). -- http://mail.python.org/mailman/listinfo/python-list