How quickly is it possible to learn a foreign language? I estimate that, with an optimally structured curriculum, learning a completely foreign language with no structure or vocabulary in common with any language you know should take 70 hours or less --- about a week --- to get to a level of full, if slow, reading competency. This is such an astonishingly short estimate that it must surely be wrong, but I don’t know exactly how. Alternatively, proper curriculum design could enable people to learn things such as foreign languages two orders of magnitude faster than they currently do.
Understanding a certain minimal vocabulary size is needed for fluency. ---------------------------------------------------------------------- One of the larger tasks in learning a foreign language is vocabulary memorization. The morphemic inventory of a language is essentially arbitrary and thus must be memorized, along with a collection of idiomatic phrases with arbitrary meanings. But memorizing the entire vocabulary of any language is impossible, because vocabularies are open, not closed, and learning words or morphemes at random (say, by choosing random pages in a dictionary) will tend to be very inefficient, because the vast majority of distinct morphemes (and words) are hapaxes: they only occur once in the entire corpus of the language, so learning them is a waste of time. There’s a threshold effect in vocabulary learning, vaguely analogous to the threshold effect in percolation theory. If you know only 80% of the words (or morphemes) in a text, it’s likely to be almost completely impenetrable. But if you know 95%, you’re likely to be able to guess the other 5% of the words by context. Somewhere between the two is a critical threshold. The question then arises, how many morphemes do you need to know in order to understand 85% or 90% or 95% of the morphemes in a text? Of course, the size varies by text; some texts use much larger vocabularies than others. That vocabulary size is about 6000 morphemes. --------------------------------------------- <http://www.balancedreading.com/vocabulary.html> makes the following claims about vocabulary sizes: | Group | Vocabulary size | Source | |-------------------------------------------+--------------------------------------+------------------------------------| | 5 to 6 year olds | 2500-5000 words | Beck and McKeown (1991) | | literate adults | 50000 words | (his own guess) | | an average high school senior | 45000 words | Nagy and Anderson (1984) | | same | 17000 words | D’Anna, Zechmeister, & Hall (1991) | | same | 5000 words | Hirsh & Nation (1992) | | the English language | 50000-60000 “word families” (lemmas) | “most linguists” | | adults who “don’t read very much” | 5000-10000 word families | (his own guess) | | “a ‘typical’ college graduate” | 20000 word families | “I have seen estimates” | | “people who read 3 to 4 hours a day” | 25000 word families | (his own guess) | | “somebody who dropped out of high school” | 5000-6000 word families | (his own guess) | You should also be able to make a sort of ballpark estimate by doing the statistics on a large corpus. I have handy the word frequencies from the British National Corpus for all words that occur 5 or more times. I ought to do these statistics over the entire corpus, and using morphological analysis, but this is a first cut. > Aside: Google’s teraword corpus > ------------------------------- > > I wanted to redo this analysis with the corpus of N-grams over a > trillion tokens that Google has published, but the only word > frequency list I have handy is the one Norvig posted on his web page > at <http://norvig.com/ngrams/>, which only covers 333,333 types, > which together only add up to about 588 billion words: less than 60% > of the total corpus. I don’t know why the most common third of a > million types cover less than two-thirds of the corpus. The last > “words” in there are nonword things like “lolge”, “oooglo”, > “klonipan”, “britishairs”, “lgoggle”, “magapass”, “offool”, > “antemortem”, “iconw”, and “sshool”, occurring at about 13 parts per > billion each. I’m quite sure that it is not the case that two out of > every five words in the Web *I’m* reading are weirder and more > uncommon than “lgoggle”, and no significant amount of what I'm > reading consists of near-anagrams of “google”. > > So I don’t know what’s going on here but I don’t think this corpus > is useful for my purposes at the moment. In this part of the BNC corpus, the total number of tokens (of 109 557 distinct types, most of which are misspellings) is 90 080 933, about 90 million. 80% of this number is 72 064 746. To know 80% of the words (tokens) in this part of the corpus, you would have to know all 2028 words that occur at least as frequently as “lie”, which occurs 4610 times (one out of every 19 540 words). The most frequent words you wouldn’t know yet would be: stress severe liked mentioned contains. 85% of this number is 76 568 793. To know 85% of these words, you would have to know all 3335 words that occur at least as frequently as “leather”, which occurs 2564 times (one out of every 35 132 words). The most frequent words you wouldn’t know yet would be: recommendations appreciate alexander cool solutions. 90% of this number is 81 072 839. To know 90% of these words, you’d need 5982 words up to “fry”, 1155 times (one out of every 78 195 words), leaving words like: subjective rage pencil cheerful cd. 95% of this number is 85 576 886. To know 95%, you’d need 13 141 words up to “loser”, 347 times (one out of every quarter-million words), leaving words like: hammersmith graded exporters englishmen endured. Now, there are some difficulties that may make these estimates meaningless, quite aside from the large spread between 3000 and 13000 words! First, the number of morphemes is quite a bit smaller than the total number of words (“lose” occurs in five obvious forms: lose, loser, loses, losers, losing, plus lost and lostness, although “loser” has an idiomatic meaning that must be memorized, while “fry” occurs in nine forms: frying, fryer, fryers, frys, fried, fries, frier, and frypan); second, the words that occur fewer than 5 times in the BNC may actually shift the 95th percentile by quite a large number of distinct morphemes. Memorizing 6000 morphemes should take about 50 hours. ----------------------------------------------------- For now I will choose to pretend that the number of essential morphemes to reach the critical threshold is somewhere around 6000. How to estimate how difficult this learning task is? If we model human long-term memory as a simple store that takes input at about 0.5 bits per second, we need only estimate how many bits are contained in 6000 or so arbitrary meanings. Clearly each one of these requires about 13 bits to distinguish it from any of the others, but perhaps a couple more bits to distinguish it from other equally basic concepts that happen not to be encoded as single morphemes in a particular language. So that gives us around 90 000 bits, which should require around 180 000 seconds of optimal memorization: only about 50 hours! This would involve learning about two new vocabulary words per minute, which seems like a plausible rate. People currently take much, much more than 50 hours to learn a new language. ---------------------------------------------------------------------------- 50 hours is a rather remarkably small number. If it’s really accurate, it should be possible to learn the entire essential vocabulary of a new foreign language each week or so, or possibly even twice a week, to the point of being able to glark the meanings of new terms from context. This is at least one order of magnitude better than commonly-observed performance in foreign-language learning. Why might this be? One possible explanation is that people are usually learning not only the vocabulary of the language, but also its alphabet, orthography, phonology, morphology, syntax, and pragmatics at the same time. A second possible explanation is that typical vocabulary memorization is very badly structured: much memorization time is wasted on overlearning of common morphemes (after a certain point, you don’t encounter new morphemes often enough any more, or re-encounter them early enough to reinforce the previous learning of them), while much of the rest is wasted on learning morphemes that aren’t among the most frequent few thousand (words like “reverse”, “retaliate”, “railroad”, and the like, which could likely be learned from context later on); and students often have to uselessly learn different compounds of the same morpheme because they are learning the morpheme before they learn the productive morphology that would allow them to represent the compound words more efficiently in their memory. A third possibility is that the model is wrong: either representing the meaning of a morpheme takes many more than 15 bits, many more than 6000 morphemes are needed to reach the critical threshold, or human long-term memory is not capable of accepting input at such a high sustained rate, or it is only capable of doing so under special circumstances. If each new morpheme must be presented 20 times (at exponentially increasing intervals) before being properly learned, that leaves time for about 1.5 seconds per repetition, which seems challenging and exhausting, but not implausible. If such a system worked properly, then once you had learned the alphabet, orthography, phonology, and morphology of a language (along with perhaps the couple hundred most common morphemes), you could spend 8 hours a day for 6 or 7 days being drilled on vocabulary morphemes, starting with the most common ones --- in a featureless white room with no distractions --- and go home exhausted afterwards. And in that week, you’d go from almost no vocabulary to a working vocabulary that allowed you to read and speak the language with apparent fluency. If the language had many cognates or calques with a language you already knew, you could do it much faster, perhaps in two or three days. This would be an incredible feat. Other parts of the language learning task would total 12 to 20 hours. --------------------------------------------------------------------- How long would it take to learn the rest of the language? The other learning tasks are essentially procedural rather than declarative knowledge, and so they must be learned by usage and practice, rather than drills. However, as I asserted at the beginning, they require memorizing much less information than the vocabulary does. I think you could probably learn a new alphabetic script in two to eight hours, although the orthography, phonology, and morphology of the language may be considerably more arbitrary than that. Still, if you could be explicitly given the information needed about these aspects of the language, then practice at the edge of your competency with continuous feedback until you’d reached a reasonable level of mastery, you ought to be able to reach a minimal level of mastery of these other aspects all together in time comparable to that needed to learn the vocabulary. This curriculum would need to be administered by a piece of software that statistically modeled the current state of your knowledge, so as to be able to drill you on what needed drilling at that moment, and kept drilling you at the edge of your competency continuously for many hours a day. No human instructor could be expected to keep such a complete model of your competency in their limited human mind, and even if they could, they would have to work just as hard as you were. Constructing the curriculum would require a careful linguistic analysis of a particular dialect of the language in question, put into a machine-readable form. First, the learner would study the phonology of the language. This requires training a speech-recognition system for the language, then having the student repeat back common words, correcting them (“bit, not beat!”) when they mispronounce them. This also requires analysis of the most frequent phonemes in the language, and also perhaps an analysis of which distinctions are the most difficult for new learners to learn to recognize and reproduce. This should probably overlap somewhat with the study of the language’s orthography: once the basics of the phonetic system are down (maybe 40 phonemes: 10 minutes to be able to distinguish them all?) you can present the names of the letters and grade the learner on their ability to name the letters and draw the letters named, again correcting them when they get it wrong. This should require perhaps another 10 minutes. At this point, you can also present the spellings of the words that are being used for pronunciation practice, now that the learner knows how to recognize and produce the different phonemes; and you can start to teach a keyboard layout that will allow the learner to indicate their recognition of a word by typing it rather than handwriting it or repeating it aloud. Next would be phonotactics: how phonemes can fit together, the rules that govern when particular allophones are produced, which is crucial to reproducing a particular accent correctly. These rules can be relatively complex, so this might require a few hours of drill. Once the learner can read, write, pronounce, and hear the language up to the phonetic level with a reasonable level of accuracy, it’s time to present morphology. There may be a few hundred rules to learn, each with particular morphemes they necessarily involve. Learning a few hundred morphological rules might take a while. Since the learner has internalized the phonotactic rules of the language to some degree, they can compress the phonetic realization of each morpheme using a first- or second-order Markov chain model, which I think should compress the pronunciation of a typical morpheme to 8 bits or so. So learning, say, 300 morphemes requires memorizing 2400 bits, which should take about 80 minutes --- the rules themselves will be substantially more complex to memorize and practice, requiring considerably more than 8 bits to represent. So this could take 10 hours or more. At this point the learner has spent maybe 12 to 20 hours learning the basics of the language, and knows the 500 or so most common morphemes, enough to recognize maybe two-thirds or three-quarters of the morphemes in most texts. At this point they must embark upon the task estimated earlier at 50 hours. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol