Re: [agi] Fast Parsing in support of the coming Intelligent Internet

Steve Richfield Tue, 19 Mar 2013 07:35:39 -0700

Jim,

On Tue, Mar 19, 2013 at 5:02 AM, Jim Bromer <[email protected]> wrote:


> Is your fast hash function (which isn't actually any
>> faster) and looking up rules based on the least common words first
>> really going to solve all of the well known NLP problems like
>> ambiguity and noise
>
>
> No better than the best of other methods. As I clearly state everywhere,
> the ONLY benefit here is its orders-of-magnitude faster speed...
> As I explained, the "complexity" problem is more related to speed than
> complexity. There probably aren't any problems that a man-decade of
> linguists creating rules wouldn't overcome. The problem is that until my
> proposal, there hasn't been any approach that was fast enough to execute
> their work product. Without any way to execute or even test it, no one is
> going to pay for it. Without money, it will never happen. Hence, my parsing
> method is CRITICAL for further work on an intelligent Internet, whether
> using your methods or mine.
> ----------------
>
> So you haven't actually tested the software with real nlp problems?
>

Correct. In the real world where people value their intellectual property
(IP), it is common practice to patent something BEFORE doing anything else,
build it, and then file a new patent that re-patents the original material
while integrating everything that was learned between concept an execution.
This is that 1st patent.


> Or have you tried it with a few real nlp problems?
>

Only paper-checking.

>
> OK, could you explain to me why you think your approach is fast enough to
> execute their work product.
>

Recasting your interesting question: "Why do you think the additional speed
will be enough?", which is a REALLY good question:

It isn't that my method is a particular ratio faster, but rather that its
speed is unrelated to the inapplicable rules. For example, if you merged
Russian rules along with English rules (while omitting language detection)
in other approaches, their rules would work more slowly because there would
be more possibilities driving every decision. However with my bottom-up
order of triggered evaluation, the only potential slowdown would be in
disambiguating words that are spelled the same while meaning different
things in the two languages. Lacking the significant use of such words, the
speed of a system that processed both languages together would be
unaffected by the additional rules.

I have looked at the effects of recognizing and substituting thousands of
idioms working from an idiom dictionary (to avoid "cherry picking" the
cases). Unfortunately there are no frequency of use statistics available,
and Google's numbers are worthless. About half are simplistic
substitutions. The other half have various challenges, e.g.
1.  Simple substitution would break the surrounding grammar.
2.  Sometimes it isn't an idiom - they literally mean what they said (which
is where some idioms come from), so disambiguation is needed.
3.  Some idioms have more than one idiomatic meaning, requiring more
advanced disambiguation.
Idiom and much other disambiguation is all about subject identification,
often comparing the likelihood of one subject over another. This is
typically handled with heuristics like counting rarely used words in
various subjects, and comparing the counts.

Hence, the real-world PERFORMED rule-count to disambiguate and consider the
expansion of thousands of idioms would be one for every subject-specific
word (to increment a count), and one for every least frequently used word
in ANY idiom. I'd guess (you might guess differently) that that handling
all of the thousands of idioms would end up performing around one rule per
input word.

Note that the sorts of rules used in most cases are EXTREMELY simple
integer operations, that you could easily single-step through in a debugger.

My one residual concern, which might also be your concern, that may require
implantation to fully evaluate is whether a variant of Amdahl's Law will be
a problem, namely, now that I have a method of handling the BIG challenge,
will the small stuff rise to eat my lunch. For example, rules are required
to identify person, tense, phrase separations, etc., and these rules must
work with common words, so they would be MUCH more "expensive" than the
zillions of rules that work with less common words. These rules wouldn't do
much, but given man-years to create countless rules, there might be several
per word that must be preformed, amortized over all text. I would think
that 10 performed rules/word of ANALYZED text would be some sort of an
upper limit, even if Amdahl's Law turns out to be a bit of a challenge.
Remember, if a posting doesn't contain a word indicating that it might be
ripe for response, then analysis is skipped altogether. If 10% pass this
test, then we are looking at performing ~one rule/word of text on the
Internet.

NLP projects how typically have a phase where their implementers rip out
expensive but unproductive code/features to run fast, and I suspect that
similar "tuning" would be done with a production version of my approach,
e.g. instead of looking for 1st person indicating words like "I" that are
SO common, a more efficient approach might be to look for 2nd person and
3rd person indicating words, and if not found, presume 1st person.

Can you give me an explanation of how it would work because I haven't quite
> figured out what you are talking about. You would use the hashing method
> dynamically?  What is being stored if not the hash?
>

The hashing is just a step along the way to efficiently converting tokens
to ordinals. All NLP is done on the ordinals (e.g. "the"=1). A part of the
conversion from token to ordinals requires collision detection, which is
the ONLY use made of the hashes stored in the Lexicon.

Thanks for the really good question.

Steve
=================

> On Mon, Mar 18, 2013 at 11:17 PM, Steve Richfield <
> [email protected]> wrote:
>
>> Matt,
>>
>> You apparently missed some important details, that led you to jump to
>> some understandable confusions...
>>
>> On Mon, Mar 18, 2013 at 6:03 PM, Matt Mahoney <[email protected]>wrote:
>>
>>> On Mon, Mar 18, 2013 at 4:29 PM, Steve Richfield
>>> <[email protected]> wrote:
>>>
>>>  As to your hash function, I don't see why this should be any faster
>>> than integer arithmetic. Your function multiplies the previous hash by
>>> pi/4, then adds the next character. To get the final index, you
>>> multiply by a large prime number and truncate to an integer. Of
>>> course, all of this could be done with 32 or 64 bit integer arithmetic
>>> just as fast (or faster), without the problem of rounding errors.
>>>
>>
>> First, rounding errors are no problem at all for reasons explained below.
>>
>> To get enough bits to have the one-in-a-million chance of having two
>> words having the same hash, 32-bits wouldn't be enough,but 64 bits would
>> work great.
>>
>> The problem with 64-bit integer arithmetic (which has the benefit of
>> having a longer "mantissa") is that (as explained in the specification) you
>> must carefully avoid integer overflows, and THAT is what slows down integer
>> approaches.
>>
>> Specifically, expressions like (a+b+c) and many others will give you
>>> different results depending on the order of evaluation, which depends
>>> on which compiler you use, which version, and which optimization
>>> settings you use.
>>
>>
>> Right, but so what?! So long as hashing the same thing, with the SAME
>> routine, during the SAME execution, produces the SAME result, it works.
>> Hashes would never be stored between executions, if for no other reason
>> than it would get in the way of software, tables, and rules maintenance.
>>
>>
>>> It makes a difference whether the computation is
>>> done in the x87 or SSE registers (which depends on compiler options)
>>> because x87 uses 80 bit temporaries but SSE uses 64 bits. They also
>>> handle underflow differently.
>>
>>
>> Again, so long as executing the same code given the same data produces
>> the same results, no problem.
>>
>>
>>> It may work fine when you tested it,
>>> then fail when you update your software and it can't read the old hash
>>> tables anymore because something else changed in the build process.
>>>
>>
>> Old hash tables would NEVER be read. Instead, they would be recreated for
>> every execution. I foresee minutes of execution time to get through
>> initialization, once there are thousands of words and thousands of rules to
>> process before starting to work.
>>
>>
>>> And no, you do not get overflow errors with integer arithmetic. The
>>> result is just truncated, which for many hash functions is actually
>>> what you want.
>>>
>>
>> Sure, I suppose that could be made to work if you can avoid the
>> overflows.
>>
>>>
>>> The rest, I guess, is just speculation. Sure, the user could write
>>> lots of rules for parsing English in order to trigger rules for
>>> sending spam.
>>
>>
>> Russell and I carefully wrung out the fine line between "SPAM" and
>> semi-intelligent human responses during a prior go-round on here on the AGI
>> forum. The final conclusion was that if the comments address what the user
>> said, and responding them gets to a person who approved the original
>> message to continue the conversation, then it was NOT "SPAM", because the
>> conversation is with a computer-assisted human and NOT a machine.
>>
>> Is your fast hash function (which isn't actually any
>>> faster) and looking up rules based on the least common words first
>>> really going to solve all of the well known NLP problems like
>>> ambiguity and noise
>>
>>
>> No better than the best of other methods. As I clearly state everywhere,
>> the ONLY benefit here is its orders-of-magnitude faster speed.
>>
>> and just the enormous complexity of natural language?
>>
>>
>> As I explained, the "complexity" problem is more related to speed than
>> complexity. There probably aren't any problems that a man-decade of
>> linguists creating rules wouldn't overcome. The problem is that until my
>> proposal, there hasn't been any approach that was fast enough to execute
>> their work product. Without any way to execute or even test it, no one is
>> going to pay for it. Without money, it will never happen. Hence, my parsing
>> method is CRITICAL for further work on an intelligent Internet, whether
>> using your methods or mine.
>>
>> Do YOU see any problems that a man-decade of linguistic work can't
>> overcome, given a system capable of executing it at practical speeds?
>>
>> Do you really think that your web crawler will run on a
>>> single computer
>>
>>
>> No, it would obviously take quite a few computers. As explained in the
>> specification, most of the "crawlers" (of which there would be many) would
>> have to have smarts that present crawlers don't have - to capture available
>> metadata as they crawl. This means custom ad hoc programming to handle
>> every sort of forum and social media site.
>>
>>
>>> and you are going to tell Google they are doing it all
>>> wrong because they need a 100 petabyte index and a big building with
>>> cooling towers?
>>
>>
>> You and I aren't looking at doing what Google does. As explained in the
>> specification, they search, WolframAlpha answers questions, and my approach
>> addresses problems. These are all different activities requiring different
>> approaches. It isn't that one is particularly more difficult than another,
>> they are just different.
>>
>> Do you really think that users will install spyware on
>>> their computers to get around Facebook and Twitter blocking web
>>> crawlers just so they can receive your spam?
>>
>>
>> Not to receive spam, but rather to pass it for profit. You failed to note
>> the key differences in Figure 1 between user/prospects and
>> user/representatives. The user/prospects don't do anything to participate,
>> and the user/representatives do what they do for a piece of the action.
>>
>>
>>> Do you really think you
>>> are going to intercept email and send spam cleverly disguised to look
>>> like personal replies.
>>
>>
>> There is no "inteception" involved. Instead, there is a mechanism
>> eliminate the forwarding path for the benefit of all EXCEPT the companies
>> whose lunch I intend to eat.
>>
>>
>>> Do you really plan to crawl every blog, look
>>> for posts that might be relevant to your spam and post replies?
>>
>>
>> Many blogs don't permit replies. The ones that do are usually called
>> "forums".
>>
>>
>>> How do
>>> you plan to get the poster's email address?
>>
>>
>> I don't need it if it is possible to post replies on the forum, though I
>> would prefer it to hide the response from others, to reduce the apparent
>> "footprint" of this approach.
>>
>>
>>> How do you plan to get
>>> past the CAPTCHAS,
>>
>>
>> Most forums don't have these, but people have used the Mechanical Turk to
>> do this. Remember, the volume is MUCH less than with "SPAM" and well within
>> the range of paid labor.
>>
>>
>>> logins,
>>
>>
>> THIS is why I use user/representative's computers.
>>
>>
>>> spam filters,
>>
>>
>> I don't think this will come close to being caught by spam filters.
>> Greater performance can be achieved with <<1% of the messages, and being
>> built on unique postings, they will have MUCH greater variability than
>> traditional spam.
>>
>> This won't be nearly as obnoxious as present-day advertisements, so I
>> don't foresee a big problem.
>>
>> and various other forms of moderation?
>>
>>
>> The BIG question in my own mind is: How welcome (or unwelcome) will this
>> be? If it has enough knowledge and is careful enough when sending messages,
>> it might be welcomed in some venues. The BIG challenge to achieving this is
>> not 'blowing the cred" before it works right - which is why messages are
>> routed through human reviewers.
>>
>> Of course, donating a piece of the action to the forums this responds to
>> would also help its acceptablility.
>>
>> Do you think that bloggers have never dealt with spam
>>> before?
>>>
>>
>> But, they DO willingly deal with advertisers. This would produce
>> fewer/better advertisements and could easily be plugged in place of their
>> present advertising.
>>
>> So long as everyone gets a piece of the action, I suspect that this will
>> work quite well.
>>
>>>
>>> BTW, I do appreciate the reference to my AGI proposal.
>>>
>>
>> You are very welcome. While we may argue over what approach is best, you
>> ARE and always will be the first to propose an overall approach. Without
>> having reflected on your proposal, I probably would never have made this
>> patent application.
>>
>> THANKS, both for your original proposal, and for your reply here.
>>
>> Steve
>>
>>    *AGI* | Archives <https://www.listbox.com/member/archive/303/=now>
>> <https://www.listbox.com/member/archive/rss/303/10561250-470149cf> |
>> Modify <https://www.listbox.com/member/?&;> Your Subscription
>> <http://www.listbox.com>
>>
>
>    *AGI* | Archives <https://www.listbox.com/member/archive/303/=now>
> <https://www.listbox.com/member/archive/rss/303/10443978-6f4c28ac> |
> Modify<https://www.listbox.com/member/?&;>Your Subscription
> <http://www.listbox.com>
>



-- 
Full employment can be had with the stoke of a pen. Simply institute a six
hour workday. That will easily create enough new jobs to bring back full
employment.



-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Re: [agi] Fast Parsing in support of the coming Intelligent Internet

Reply via email to