Re: [agi] Fast Parsing in support of the coming Intelligent Internet

Steve Richfield Mon, 18 Mar 2013 20:18:09 -0700

Matt,

You apparently missed some important details, that led you to jump to some
understandable confusions...

On Mon, Mar 18, 2013 at 6:03 PM, Matt Mahoney <[email protected]>wrote:

> On Mon, Mar 18, 2013 at 4:29 PM, Steve Richfield
> <[email protected]> wrote:
>
>  As to your hash function, I don't see why this should be any faster
> than integer arithmetic. Your function multiplies the previous hash by
> pi/4, then adds the next character. To get the final index, you
> multiply by a large prime number and truncate to an integer. Of
> course, all of this could be done with 32 or 64 bit integer arithmetic
> just as fast (or faster), without the problem of rounding errors.
>

First, rounding errors are no problem at all for reasons explained below.

To get enough bits to have the one-in-a-million chance of having two words
having the same hash, 32-bits wouldn't be enough,but 64 bits would work
great.

The problem with 64-bit integer arithmetic (which has the benefit of having
a longer "mantissa") is that (as explained in the specification) you must
carefully avoid integer overflows, and THAT is what slows down integer
approaches.

Specifically, expressions like (a+b+c) and many others will give you
> different results depending on the order of evaluation, which depends
> on which compiler you use, which version, and which optimization
> settings you use.

Right, but so what?! So long as hashing the same thing, with the SAME
routine, during the SAME execution, produces the SAME result, it works.
Hashes would never be stored between executions, if for no other reason
than it would get in the way of software, tables, and rules maintenance.

> It makes a difference whether the computation is
> done in the x87 or SSE registers (which depends on compiler options)
> because x87 uses 80 bit temporaries but SSE uses 64 bits. They also
> handle underflow differently.

Again, so long as executing the same code given the same data produces the
same results, no problem.

> It may work fine when you tested it,
> then fail when you update your software and it can't read the old hash
> tables anymore because something else changed in the build process.
>

Old hash tables would NEVER be read. Instead, they would be recreated for
every execution. I foresee minutes of execution time to get through
initialization, once there are thousands of words and thousands of rules to
process before starting to work.

> And no, you do not get overflow errors with integer arithmetic. The
> result is just truncated, which for many hash functions is actually
> what you want.
>

Sure, I suppose that could be made to work if you can avoid the overflows.

>
> The rest, I guess, is just speculation. Sure, the user could write
> lots of rules for parsing English in order to trigger rules for
> sending spam.

Russell and I carefully wrung out the fine line between "SPAM" and
semi-intelligent human responses during a prior go-round on here on the AGI
forum. The final conclusion was that if the comments address what the user
said, and responding them gets to a person who approved the original
message to continue the conversation, then it was NOT "SPAM", because the
conversation is with a computer-assisted human and NOT a machine.

Is your fast hash function (which isn't actually any
> faster) and looking up rules based on the least common words first
> really going to solve all of the well known NLP problems like
> ambiguity and noise

No better than the best of other methods. As I clearly state everywhere,
the ONLY benefit here is its orders-of-magnitude faster speed.

and just the enormous complexity of natural language?

As I explained, the "complexity" problem is more related to speed than
complexity. There probably aren't any problems that a man-decade of
linguists creating rules wouldn't overcome. The problem is that until my
proposal, there hasn't been any approach that was fast enough to execute
their work product. Without any way to execute or even test it, no one is
going to pay for it. Without money, it will never happen. Hence, my parsing
method is CRITICAL for further work on an intelligent Internet, whether
using your methods or mine.

Do YOU see any problems that a man-decade of linguistic work can't
overcome, given a system capable of executing it at practical speeds?

Do you really think that your web crawler will run on a
> single computer

No, it would obviously take quite a few computers. As explained in the
specification, most of the "crawlers" (of which there would be many) would
have to have smarts that present crawlers don't have - to capture available
metadata as they crawl. This means custom ad hoc programming to handle
every sort of forum and social media site.

> and you are going to tell Google they are doing it all
> wrong because they need a 100 petabyte index and a big building with
> cooling towers?

You and I aren't looking at doing what Google does. As explained in the
specification, they search, WolframAlpha answers questions, and my approach
addresses problems. These are all different activities requiring different
approaches. It isn't that one is particularly more difficult than another,
they are just different.

Do you really think that users will install spyware on
> their computers to get around Facebook and Twitter blocking web
> crawlers just so they can receive your spam?

Not to receive spam, but rather to pass it for profit. You failed to note
the key differences in Figure 1 between user/prospects and
user/representatives. The user/prospects don't do anything to participate,
and the user/representatives do what they do for a piece of the action.

> Do you really think you
> are going to intercept email and send spam cleverly disguised to look
> like personal replies.

There is no "inteception" involved. Instead, there is a mechanism eliminate
the forwarding path for the benefit of all EXCEPT the companies whose lunch
I intend to eat.

> Do you really plan to crawl every blog, look
> for posts that might be relevant to your spam and post replies?

Many blogs don't permit replies. The ones that do are usually called
"forums".

> How do
> you plan to get the poster's email address?

I don't need it if it is possible to post replies on the forum, though I
would prefer it to hide the response from others, to reduce the apparent
"footprint" of this approach.

> How do you plan to get
> past the CAPTCHAS,

Most forums don't have these, but people have used the Mechanical Turk to
do this. Remember, the volume is MUCH less than with "SPAM" and well within
the range of paid labor.

> logins,

THIS is why I use user/representative's computers.

> spam filters,

I don't think this will come close to being caught by spam filters. Greater
performance can be achieved with <<1% of the messages, and being built on
unique postings, they will have MUCH greater variability than traditional
spam.

This won't be nearly as obnoxious as present-day advertisements, so I don't
foresee a big problem.

and various other forms of moderation?

The BIG question in my own mind is: How welcome (or unwelcome) will this
be? If it has enough knowledge and is careful enough when sending messages,
it might be welcomed in some venues. The BIG challenge to achieving this is
not 'blowing the cred" before it works right - which is why messages are
routed through human reviewers.

Of course, donating a piece of the action to the forums this responds to
would also help its acceptablility.

Do you think that bloggers have never dealt with spam
> before?
>

But, they DO willingly deal with advertisers. This would produce
fewer/better advertisements and could easily be plugged in place of their
present advertising.

So long as everyone gets a piece of the action, I suspect that this will
work quite well.

>
> BTW, I do appreciate the reference to my AGI proposal.
>

You are very welcome. While we may argue over what approach is best, you
ARE and always will be the first to propose an overall approach. Without
having reflected on your proposal, I probably would never have made this
patent application.

THANKS, both for your original proposal, and for your reply here.

Steve

-------------------------------------------
AGI
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657
Powered by Listbox: http://www.listbox.com

Re: [agi] Fast Parsing in support of the coming Intelligent Internet

Reply via email to