Re: [nyphp-talk] SQL Full text searcing and storing

Paul Houle Sat, 14 Apr 2007 21:10:12 -0700

Ben Sgro (ProjectSkyline) wrote:

I'm curious what field types, indexes are best for this type ofapplication.I'm also curious about benchmarks. Google can return a huge number ofrecords in a fraction of a second.Is this a forked process, each doing small amounts of work, or onelarge beefy server doing the transaction?My DB experiance is mostly mySQL and I would prefer to build using this.

Most popular relational databases have a full text searchextension. This includes MySQL, Oracle, MS SQL and Postgres --unfortunately, these implementations do not correspond to anystandard, so the details are different for different databases. MySQLhas a full-text index that works quite well, with that caveat that itonly works for MyISAM at the moment:


   http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

this means you can have full-text indexes or transactions, not both.Assuming you have a MyISAM table, implementing fulltext search is assimple as creating a FULLTEXT index and running queries with the MATCHoperator. You could be benchmarking MySQL with your own data in lessthan half an hour (unless it takes more than that to rebuild yourindex!) The fulltext index will be updated automatically when youmodify the table, so it's a real "set-it-and-forget-it" proposition.

Postgresql implements FT search with the "tsearch2" extension whichinvolves a shared library and stored procedures. It feels a lot likesomebodies science project -- you'll need to write your own code tomaintain the index, and dump/restores of your database may be anadventure, but tsearch2 is flexible and lets you do really neat things.

Years ago I spent a weekend writing some perl scripts that create aninverted index in mysql tables. It's an inefficient way to do full textsearching, but it lets you do things that other search engines don'tdo, such as similarity searches. I loaded about 200,000 shortdocuments into it on a cheap PC and found I could get interactiveresponses (<10 seconds) doing some pretty fancy things. I've used thisto support a few little research projcts.

There are plenty of other specialized full-text engines, such asLucene and Xerces, that do a great job, but would require you to dowork to maintain the index.


   So far as Google,  here's what I can tell you:

* Google almost certainly is based on a distributed main-memorydatabase. Google keeps it's index in the RAM of a large number ofcomputers... It's too big to fit in the RAM of one machine, so queriesget split between several machines.

* Google's most critical 'secret' isn't pagerank, but the use ofimplicit phrase searching. Older methods of text retrieval don'tconsider the ordering of words when scoring -- if you want to getresults like Google, you really do need to score words higher when theyare in proximity to each other, and you need an algorithm that blendsthis well with other sources of information.

* Google's other 'secret' is that it's a got a huge amount of text toworth with. The real intelligence is in the data it searches, ~not~ init's algorithms -- Google's algorithms just bring that intelligenceout. You have different problems when you do text retrieval atdifferent scales: if you've got 100 documents and expect 2 to berelevant for a query, it's probably not acceptable to have a recall of50%... For small numbers of documents, you have to work hard toeliminate false negatives. If you've got 10 billion documents, and 1million are relevant to a particular topic, you don't really care ifyou lose 90% of them -- but you do care that a few really excellentdocuments float to the top.

Academic researchers who tried applying algorithms like PageRank tosmall data sets (1 million documents) couldn't produce evidence thatPageRank works -- because it doesn't for small data sets.

_______________________________________________
New York PHP Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

NYPHPCon 2006 Presentations Online
http://www.nyphpcon.com

Show Your Participation in New York PHP
http://www.nyphp.org/show_participation.php

Re: [nyphp-talk] SQL Full text searcing and storing

Reply via email to