could we learn a new foreign language every week?

Kragen Javier Sitaker Thu, 25 Mar 2010 00:37:06 -0700

How quickly is it possible to learn a foreign language?

I estimate that, with an optimally structured curriculum, learning a
completely foreign language with no structure or vocabulary in common
with any language you know should take 70 hours or less --- about a
week --- to get to a level of full, if slow, reading competency. This
is such an astonishingly short estimate that it must surely be wrong,
but I don’t know exactly how. Alternatively, proper curriculum design
could enable people to learn things such as foreign languages two
orders of magnitude faster than they currently do.


Understanding a certain minimal vocabulary size is needed for fluency.
----------------------------------------------------------------------

One of the larger tasks in learning a foreign language is vocabulary
memorization. The morphemic inventory of a language is essentially
arbitrary and thus must be memorized, along with a collection of
idiomatic phrases with arbitrary meanings.

But memorizing the entire vocabulary of any language is impossible,
because vocabularies are open, not closed, and learning words or
morphemes at random (say, by choosing random pages in a dictionary)
will tend to be very inefficient, because the vast majority of
distinct morphemes (and words) are hapaxes: they only occur once in
the entire corpus of the language, so learning them is a waste of
time.

There’s a threshold effect in vocabulary learning, vaguely analogous
to the threshold effect in percolation theory. If you know only 80% of
the words (or morphemes) in a text, it’s likely to be almost
completely impenetrable. But if you know 95%, you’re likely to be able
to guess the other 5% of the words by context.  Somewhere between the
two is a critical threshold. The question then arises, how many
morphemes do you need to know in order to understand 85% or 90% or 95%
of the morphemes in a text?

Of course, the size varies by text; some texts use much larger
vocabularies than others.

That vocabulary size is about 6000 morphemes.
---------------------------------------------

<http://www.balancedreading.com/vocabulary.html> makes the following
claims about vocabulary sizes:

    | Group                                     | Vocabulary size               
           | Source                             |
    
|-------------------------------------------+--------------------------------------+------------------------------------|
    | 5 to 6 year olds                          | 2500-5000 words               
           | Beck and McKeown (1991)            |
    | literate adults                           | 50000 words                   
           | (his own guess)                    |
    | an average high school senior             | 45000 words                   
           | Nagy and Anderson (1984)           |
    | same                                      | 17000 words                   
           | D’Anna, Zechmeister, & Hall (1991) |
    | same                                      | 5000 words                    
           | Hirsh & Nation (1992)              |
    | the English language                      | 50000-60000 “word     
families” (lemmas) | “most linguists”                   |
    | adults who “don’t read very much”         | 5000-10000 word families      
           | (his own guess)                    |
    | “a ‘typical’ college graduate”            | 20000 word families           
           | “I have seen estimates”            |
    | “people who read 3 to 4 hours a day”      | 25000 word families           
           | (his own guess)                    |
    | “somebody who dropped out of high school” | 5000-6000 word families       
           | (his own guess)                    |
    

You should also be able to make a sort of ballpark estimate by doing
the statistics on a large corpus.

I have handy the word frequencies from the British National Corpus for
all words that occur 5 or more times. I ought to do these statistics
over the entire corpus, and using morphological analysis, but this is
a first cut.

> Aside: Google’s teraword corpus
> -------------------------------
> 
> I wanted to redo this analysis with the corpus of N-grams over a
> trillion tokens that Google has published, but the only word
> frequency list I have handy is the one Norvig posted on his web page
> at <http://norvig.com/ngrams/>, which only covers 333,333 types,
> which together only add up to about 588 billion words: less than 60%
> of the total corpus. I don’t know why the most common third of a
> million types cover less than two-thirds of the corpus. The last
> “words” in there are nonword things like “lolge”, “oooglo”,
> “klonipan”, “britishairs”, “lgoggle”, “magapass”, “offool”,
> “antemortem”, “iconw”, and “sshool”, occurring at about 13 parts per
> billion each. I’m quite sure that it is not the case that two out of
> every five words in the Web *I’m* reading are weirder and more
> uncommon than “lgoggle”, and no significant amount of what I'm
> reading consists of near-anagrams of “google”.
>
> So I don’t know what’s going on here but I don’t think this corpus
> is useful for my purposes at the moment.

In this part of the BNC corpus, the total number of tokens (of 109 557
distinct types, most of which are misspellings) is 90 080 933, about
90 million.

80% of this number is 72 064 746. To know 80% of the words (tokens) in
this part of the corpus, you would have to know all 2028 words that
occur at least as frequently as “lie”, which occurs 4610 times (one
out of every 19 540 words). The most frequent words you wouldn’t know
yet would be: stress severe liked mentioned contains.

85% of this number is 76 568 793. To know 85% of these words, you
would have to know all 3335 words that occur at least as frequently as
“leather”, which occurs 2564 times (one out of every 35 132
words). The most frequent words you wouldn’t know yet would be:
recommendations appreciate alexander cool solutions.

90% of this number is 81 072 839. To know 90% of these words, you’d
need 5982 words up to “fry”, 1155 times (one out of every 78 195
words), leaving words like: subjective rage pencil cheerful cd.

95% of this number is 85 576 886. To know 95%, you’d need 13 141 words
up to “loser”, 347 times (one out of every quarter-million words),
leaving words like: hammersmith graded exporters englishmen endured.

Now, there are some difficulties that may make these estimates
meaningless, quite aside from the large spread between 3000 and 13000
words! First, the number of morphemes is quite a bit smaller than the
total number of words (“lose” occurs in five obvious forms: lose,
loser, loses, losers, losing, plus lost and lostness, although “loser”
has an idiomatic meaning that must be memorized, while “fry” occurs in
nine forms: frying, fryer, fryers, frys, fried, fries, frier, and
frypan); second, the words that occur fewer than 5 times in the BNC
may actually shift the 95th percentile by quite a large number of
distinct morphemes.

Memorizing 6000 morphemes should take about 50 hours.
-----------------------------------------------------

For now I will choose to pretend that the number of essential
morphemes to reach the critical threshold is somewhere around
6000. How to estimate how difficult this learning task is?

If we model human long-term memory as a simple store that takes input
at about 0.5 bits per second, we need only estimate how many bits are
contained in 6000 or so arbitrary meanings. Clearly each one of these
requires about 13 bits to distinguish it from any of the others, but
perhaps a couple more bits to distinguish it from other equally basic
concepts that happen not to be encoded as single morphemes in a
particular language.  So that gives us around 90 000 bits, which
should require around 180 000 seconds of optimal memorization: only
about 50 hours!  This would involve learning about two new vocabulary
words per minute, which seems like a plausible rate.

People currently take much, much more than 50 hours to learn a new language.
----------------------------------------------------------------------------

50 hours is a rather remarkably small number. If it’s really accurate,
it should be possible to learn the entire essential vocabulary of a
new foreign language each week or so, or possibly even twice a week,
to the point of being able to glark the meanings of new terms from
context. This is at least one order of magnitude better than
commonly-observed performance in foreign-language learning. Why might
this be?

One possible explanation is that people are usually learning not only
the vocabulary of the language, but also its alphabet, orthography,
phonology, morphology, syntax, and pragmatics at the same time.

A second possible explanation is that typical vocabulary memorization
is very badly structured: much memorization time is wasted on
overlearning of common morphemes (after a certain point, you don’t
encounter new morphemes often enough any more, or re-encounter them
early enough to reinforce the previous learning of them), while much
of the rest is wasted on learning morphemes that aren’t among the most
frequent few thousand (words like “reverse”, “retaliate”, “railroad”,
and the like, which could likely be learned from context later on);
and students often have to uselessly learn different compounds of the
same morpheme because they are learning the morpheme before they learn
the productive morphology that would allow them to represent the
compound words more efficiently in their memory.

A third possibility is that the model is wrong: either representing
the meaning of a morpheme takes many more than 15 bits, many more than
6000 morphemes are needed to reach the critical threshold, or human
long-term memory is not capable of accepting input at such a high
sustained rate, or it is only capable of doing so under special
circumstances.

If each new morpheme must be presented 20 times (at exponentially
increasing intervals) before being properly learned, that leaves time
for about 1.5 seconds per repetition, which seems challenging and
exhausting, but not implausible.

If such a system worked properly, then once you had learned the
alphabet, orthography, phonology, and morphology of a language (along
with perhaps the couple hundred most common morphemes), you could
spend 8 hours a day for 6 or 7 days being drilled on vocabulary
morphemes, starting with the most common ones --- in a featureless
white room with no distractions --- and go home exhausted
afterwards. And in that week, you’d go from almost no vocabulary to a
working vocabulary that allowed you to read and speak the language
with apparent fluency.  If the language had many cognates or calques
with a language you already knew, you could do it much faster, perhaps
in two or three days. This would be an incredible feat.

Other parts of the language learning task would total 12 to 20 hours.
---------------------------------------------------------------------

How long would it take to learn the rest of the language?  The other
learning tasks are essentially procedural rather than declarative
knowledge, and so they must be learned by usage and practice, rather
than drills. However, as I asserted at the beginning, they require
memorizing much less information than the vocabulary does.

I think you could probably learn a new alphabetic script in two to
eight hours, although the orthography, phonology, and morphology of
the language may be considerably more arbitrary than that. Still, if
you could be explicitly given the information needed about these
aspects of the language, then practice at the edge of your competency
with continuous feedback until you’d reached a reasonable level of
mastery, you ought to be able to reach a minimal level of mastery of
these other aspects all together in time comparable to that needed to
learn the vocabulary.

This curriculum would need to be administered by a piece of software
that statistically modeled the current state of your knowledge, so as
to be able to drill you on what needed drilling at that moment, and
kept drilling you at the edge of your competency continuously for many
hours a day. No human instructor could be expected to keep such a
complete model of your competency in their limited human mind, and
even if they could, they would have to work just as hard as you were.

Constructing the curriculum would require a careful linguistic
analysis of a particular dialect of the language in question, put into
a machine-readable form.

First, the learner would study the phonology of the language. This
requires training a speech-recognition system for the language, then
having the student repeat back common words, correcting them (“bit,
not beat!”) when they mispronounce them. This also requires analysis
of the most frequent phonemes in the language, and also perhaps an
analysis of which distinctions are the most difficult for new learners
to learn to recognize and reproduce.

This should probably overlap somewhat with the study of the language’s
orthography: once the basics of the phonetic system are down (maybe 40
phonemes: 10 minutes to be able to distinguish them all?) you can
present the names of the letters and grade the learner on their
ability to name the letters and draw the letters named, again
correcting them when they get it wrong. This should require perhaps
another 10 minutes.

At this point, you can also present the spellings of the words that
are being used for pronunciation practice, now that the learner knows
how to recognize and produce the different phonemes; and you can start
to teach a keyboard layout that will allow the learner to indicate
their recognition of a word by typing it rather than handwriting it or
repeating it aloud.

Next would be phonotactics: how phonemes can fit together, the rules
that govern when particular allophones are produced, which is crucial
to reproducing a particular accent correctly. These rules can be
relatively complex, so this might require a few hours of drill.

Once the learner can read, write, pronounce, and hear the language up
to the phonetic level with a reasonable level of accuracy, it’s time
to present morphology. There may be a few hundred rules to learn, each
with particular morphemes they necessarily involve.

Learning a few hundred morphological rules might take a while. Since
the learner has internalized the phonotactic rules of the language to
some degree, they can compress the phonetic realization of each
morpheme using a first- or second-order Markov chain model, which I
think should compress the pronunciation of a typical morpheme to 8
bits or so. So learning, say, 300 morphemes requires memorizing 2400
bits, which should take about 80 minutes --- the rules themselves will
be substantially more complex to memorize and practice, requiring
considerably more than 8 bits to represent. So this could take 10
hours or more.

At this point the learner has spent maybe 12 to 20 hours learning the
basics of the language, and knows the 500 or so most common morphemes,
enough to recognize maybe two-thirds or three-quarters of the
morphemes in most texts. At this point they must embark upon the task
estimated earlier at 50 hours.

-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

could we learn a new foreign language every week?

Reply via email to