Ben Sgro (ProjectSkyline) wrote:
I'm curious what field types, indexes are best for this type of application. I'm also curious about benchmarks. Google can return a huge number of records in a fraction of a second. Is this a forked process, each doing small amounts of work, or one large beefy server doing the transaction? My DB experiance is mostly mySQL and I would prefer to build using this.
Most popular relational databases have a full text search extension. This includes MySQL, Oracle, MS SQL and Postgres -- unfortunately, these implementations do not correspond to any standard, so the details are different for different databases. MySQL has a full-text index that works quite well, with that caveat that it only works for MyISAM at the moment:

   http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

this means you can have full-text indexes or transactions, not both. Assuming you have a MyISAM table, implementing fulltext search is as simple as creating a FULLTEXT index and running queries with the MATCH operator. You could be benchmarking MySQL with your own data in less than half an hour (unless it takes more than that to rebuild your index!) The fulltext index will be updated automatically when you modify the table, so it's a real "set-it-and-forget-it" proposition.

Postgresql implements FT search with the "tsearch2" extension which involves a shared library and stored procedures. It feels a lot like somebodies science project -- you'll need to write your own code to maintain the index, and dump/restores of your database may be an adventure, but tsearch2 is flexible and lets you do really neat things.

Years ago I spent a weekend writing some perl scripts that create an inverted index in mysql tables. It's an inefficient way to do full text searching, but it lets you do things that other search engines don't do, such as similarity searches. I loaded about 200,000 short documents into it on a cheap PC and found I could get interactive responses (<10 seconds) doing some pretty fancy things. I've used this to support a few little research projcts.

There are plenty of other specialized full-text engines, such as Lucene and Xerces, that do a great job, but would require you to do work to maintain the index.

   So far as Google,  here's what I can tell you:

* Google almost certainly is based on a distributed main-memory database. Google keeps it's index in the RAM of a large number of computers... It's too big to fit in the RAM of one machine, so queries get split between several machines.

* Google's most critical 'secret' isn't pagerank, but the use of implicit phrase searching. Older methods of text retrieval don't consider the ordering of words when scoring -- if you want to get results like Google, you really do need to score words higher when they are in proximity to each other, and you need an algorithm that blends this well with other sources of information.

* Google's other 'secret' is that it's a got a huge amount of text to worth with. The real intelligence is in the data it searches, ~not~ in it's algorithms -- Google's algorithms just bring that intelligence out. You have different problems when you do text retrieval at different scales: if you've got 100 documents and expect 2 to be relevant for a query, it's probably not acceptable to have a recall of 50%... For small numbers of documents, you have to work hard to eliminate false negatives. If you've got 10 billion documents, and 1 million are relevant to a particular topic, you don't really care if you lose 90% of them -- but you do care that a few really excellent documents float to the top.

Academic researchers who tried applying algorithms like PageRank to small data sets (1 million documents) couldn't produce evidence that PageRank works -- because it doesn't for small data sets.
_______________________________________________
New York PHP Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

NYPHPCon 2006 Presentations Online
http://www.nyphpcon.com

Show Your Participation in New York PHP
http://www.nyphp.org/show_participation.php

Reply via email to