Matt, You apparently missed some important details, that led you to jump to some understandable confusions...
On Mon, Mar 18, 2013 at 6:03 PM, Matt Mahoney <[email protected]>wrote: > On Mon, Mar 18, 2013 at 4:29 PM, Steve Richfield > <[email protected]> wrote: > > As to your hash function, I don't see why this should be any faster > than integer arithmetic. Your function multiplies the previous hash by > pi/4, then adds the next character. To get the final index, you > multiply by a large prime number and truncate to an integer. Of > course, all of this could be done with 32 or 64 bit integer arithmetic > just as fast (or faster), without the problem of rounding errors. > First, rounding errors are no problem at all for reasons explained below. To get enough bits to have the one-in-a-million chance of having two words having the same hash, 32-bits wouldn't be enough,but 64 bits would work great. The problem with 64-bit integer arithmetic (which has the benefit of having a longer "mantissa") is that (as explained in the specification) you must carefully avoid integer overflows, and THAT is what slows down integer approaches. Specifically, expressions like (a+b+c) and many others will give you > different results depending on the order of evaluation, which depends > on which compiler you use, which version, and which optimization > settings you use. Right, but so what?! So long as hashing the same thing, with the SAME routine, during the SAME execution, produces the SAME result, it works. Hashes would never be stored between executions, if for no other reason than it would get in the way of software, tables, and rules maintenance. > It makes a difference whether the computation is > done in the x87 or SSE registers (which depends on compiler options) > because x87 uses 80 bit temporaries but SSE uses 64 bits. They also > handle underflow differently. Again, so long as executing the same code given the same data produces the same results, no problem. > It may work fine when you tested it, > then fail when you update your software and it can't read the old hash > tables anymore because something else changed in the build process. > Old hash tables would NEVER be read. Instead, they would be recreated for every execution. I foresee minutes of execution time to get through initialization, once there are thousands of words and thousands of rules to process before starting to work. > And no, you do not get overflow errors with integer arithmetic. The > result is just truncated, which for many hash functions is actually > what you want. > Sure, I suppose that could be made to work if you can avoid the overflows. > > The rest, I guess, is just speculation. Sure, the user could write > lots of rules for parsing English in order to trigger rules for > sending spam. Russell and I carefully wrung out the fine line between "SPAM" and semi-intelligent human responses during a prior go-round on here on the AGI forum. The final conclusion was that if the comments address what the user said, and responding them gets to a person who approved the original message to continue the conversation, then it was NOT "SPAM", because the conversation is with a computer-assisted human and NOT a machine. Is your fast hash function (which isn't actually any > faster) and looking up rules based on the least common words first > really going to solve all of the well known NLP problems like > ambiguity and noise No better than the best of other methods. As I clearly state everywhere, the ONLY benefit here is its orders-of-magnitude faster speed. and just the enormous complexity of natural language? As I explained, the "complexity" problem is more related to speed than complexity. There probably aren't any problems that a man-decade of linguists creating rules wouldn't overcome. The problem is that until my proposal, there hasn't been any approach that was fast enough to execute their work product. Without any way to execute or even test it, no one is going to pay for it. Without money, it will never happen. Hence, my parsing method is CRITICAL for further work on an intelligent Internet, whether using your methods or mine. Do YOU see any problems that a man-decade of linguistic work can't overcome, given a system capable of executing it at practical speeds? Do you really think that your web crawler will run on a > single computer No, it would obviously take quite a few computers. As explained in the specification, most of the "crawlers" (of which there would be many) would have to have smarts that present crawlers don't have - to capture available metadata as they crawl. This means custom ad hoc programming to handle every sort of forum and social media site. > and you are going to tell Google they are doing it all > wrong because they need a 100 petabyte index and a big building with > cooling towers? You and I aren't looking at doing what Google does. As explained in the specification, they search, WolframAlpha answers questions, and my approach addresses problems. These are all different activities requiring different approaches. It isn't that one is particularly more difficult than another, they are just different. Do you really think that users will install spyware on > their computers to get around Facebook and Twitter blocking web > crawlers just so they can receive your spam? Not to receive spam, but rather to pass it for profit. You failed to note the key differences in Figure 1 between user/prospects and user/representatives. The user/prospects don't do anything to participate, and the user/representatives do what they do for a piece of the action. > Do you really think you > are going to intercept email and send spam cleverly disguised to look > like personal replies. There is no "inteception" involved. Instead, there is a mechanism eliminate the forwarding path for the benefit of all EXCEPT the companies whose lunch I intend to eat. > Do you really plan to crawl every blog, look > for posts that might be relevant to your spam and post replies? Many blogs don't permit replies. The ones that do are usually called "forums". > How do > you plan to get the poster's email address? I don't need it if it is possible to post replies on the forum, though I would prefer it to hide the response from others, to reduce the apparent "footprint" of this approach. > How do you plan to get > past the CAPTCHAS, Most forums don't have these, but people have used the Mechanical Turk to do this. Remember, the volume is MUCH less than with "SPAM" and well within the range of paid labor. > logins, THIS is why I use user/representative's computers. > spam filters, I don't think this will come close to being caught by spam filters. Greater performance can be achieved with <<1% of the messages, and being built on unique postings, they will have MUCH greater variability than traditional spam. This won't be nearly as obnoxious as present-day advertisements, so I don't foresee a big problem. and various other forms of moderation? The BIG question in my own mind is: How welcome (or unwelcome) will this be? If it has enough knowledge and is careful enough when sending messages, it might be welcomed in some venues. The BIG challenge to achieving this is not 'blowing the cred" before it works right - which is why messages are routed through human reviewers. Of course, donating a piece of the action to the forums this responds to would also help its acceptablility. Do you think that bloggers have never dealt with spam > before? > But, they DO willingly deal with advertisers. This would produce fewer/better advertisements and could easily be plugged in place of their present advertising. So long as everyone gets a piece of the action, I suspect that this will work quite well. > > BTW, I do appreciate the reference to my AGI proposal. > You are very welcome. While we may argue over what approach is best, you ARE and always will be the first to propose an overall approach. Without having reflected on your proposal, I probably would never have made this patent application. THANKS, both for your original proposal, and for your reply here. Steve ------------------------------------------- AGI Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/21088071-f452e424 Modify Your Subscription: https://www.listbox.com/member/?member_id=21088071&id_secret=21088071-58d57657 Powered by Listbox: http://www.listbox.com
