On Mon, Dec 14, 2015 at 5:39 PM, J Rao <[email protected]> wrote:

> Is there a reason we couldn't just measure the frequency using a big
> corpus?
>

Yea. To do that you must process the big corpus, which requires this
information in advance to make the processing go faster. In short, this
creates a sort of chicken-or-egg problem. Also, the frequencies CHANGE with
time as interests wax and wane.

Remember - it takes MANY occurrences of a word to establish its frequency
with any accuracy.

Note that the Wikipedia article I mentioned processed the entirety of
Wikipedia. You might notice the little dashes at the bottom of the red
line. Those come from words that occur just once in all of Wikipedia -
probably just spelling errors.

For really rare words, like neologies, there probably aren't enough
occurrences on the entire Internet to establish frequency from observation.

Fortunately, all **I** need to be able to do is compare the frequencies of
short lists of words with the frequencies of other short lists of words,
which hopefully won't be particularly sensitive to the effects I have been
discussing. Even if there is an "error" in such a comparison, it would be
between nearly equally occurring lists, so there would be little lost,
other than a few milliseconds of computer time.

*Steve*
===============

> On 12/15/2015 3:33 AM, Steve Richfield wrote:
>
>> Hi,
>>
>> Just to make sure we are starting on the same page, see the Wikipedia
>> article about Zipf's law at:
>>
>> https://en.wikipedia.org/wiki/Zipf's_law <
>> https://en.wikipedia.org/wiki/Zipf%27s_law>
>>
>> In summary, this provides a formula to convert word ranking into
>> approximate frequency of occurrence, which is VERY useful in identifying
>> least frequently used words to trigger processing, etc.
>>
>> Whatever formula someone might consider should sum to 1.0 over an
>> infinite list of ranked words, as each word in a text appears SOMEWHERE in
>> a ranking. However in reality, the story is more complex.
>>
>> Looking at words in Wikipedia, frequency goes as 0.07/N (which does NOT
>> converge for an infinite list of words) out to 10,000 or so, and then drops
>> off considerably more rapidly so that the millionth-ranked word is nearly 2
>> orders of magnitude less frequent than it would if the linear relationship
>> had continued. Apparently no one has (yet) done the math to fit this to
>> SOMETHING that converges to a total frequency of 1.0.
>>
>> I just HATE non-converging series.
>>
>> Note that a simple formula that fits the ENTIRE Wikipedia curve can be
>> had by simply substituting the formula 700/(N^2) for N>10^4
>>
>> OK, so where does the magic 10,000 come from? THAT appears to be our
>> basic vocabulary, beyond which various subgroups add their own specialized
>> vocabularies, explaining the rapid drop-off after 10,000 words. A corpus
>> other than Wikipedia that is an amalgamation of many disparate subjects
>> would doubtless have a very different "curve" out beyond 10,000. It looks
>> to me like the 3,000 word basic vocabulary picked the wrong number - they
>> should have gone for 10,000 words.
>>
>> This seems to also say a lot about language granularity - how finely we
>> presume the construction of our universe to be. For those who think we are
>> in some sort of simulation, this might say something about the precision of
>> such a simulation, etc.
>>
>> This seems to also say a lot about how much would be needed by an AI/AGI
>> text "understanding" system - "understanding" somewhere beyond 10^4 words
>> to be broadly useful.
>>
>> Anyway - I saw some wisdom in these numbers, along with some mathematical
>> shortfalls in the associated formulas that someone needs to be turn into
>> equations that sum to 1.0
>>
>> Thoughts?
>>
>> /Steve/
>> *AGI* | Archives <https://www.listbox.com/member/archive/303/=now> <
>> https://www.listbox.com/member/archive/rss/303/26346070-1cd82ca6> |
>> Modify <https://www.listbox.com/member/?&;> Your Subscription    [Powered
>> by Listbox] <http://www.listbox.com>
>>
>>
>
>
>
> -------------------------------------------
> AGI
> Archives: https://www.listbox.com/member/archive/303/=now
> RSS Feed: https://www.listbox.com/member/archive/rss/303/10443978-6f4c28ac
> Modify Your Subscription:
> https://www.listbox.com/member/?&;
> Powered by Listbox: http://www.listbox.com
>



-- 
Full employment can be had with the stoke of a pen. Simply institute a six
hour workday. That will easily create enough new jobs to bring back full
employment.



-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Reply via email to