PHP isn't totally bad for a search engine. Here's my story.

I was in a bit of a predicament when I first started work, because I had to 
develop a search engine for online video objects. My company is essentially 
a video re-purposing venture, where we take reams of analog, tape-based 
videos, encode them into something like MPEG or ASF or whatever, create 
clip indexes (i.e. a 30 minute clip is broken up into 10x 3-minute clips, 
with each clip described and categorized) and provide a search engine and 
interface to watch clips or videos on the web via a broadband connection. 

(An added bonus is that you can create your own "video" via personal 
playlists -- you can take 10 different clips from 10 different videos and 
run them together into one playlist, all online. In a few months, you'll be 
able to create your own clips if you don't like our predefined ones.)

Anyways, the search engine thing was my deal. I'm the only programmer 
(*period*) on our team, and I basically had to write a search engine, web 
site backend, admin interface and all that jazz for our app alone. I was 
hired March 6, 2001 or so, and I had until, oh, April 15, 2001 to do it. 
Plus there were a few conditions -- like, it should be portable and 
inexpensive.

PHP seemed like a good choice -- it was portable (Win32, *ix, whatever), it 
was cheap ($0) and it isn't too bad for rapid development.

So off I was. I did manage to finish the search engine and back end by 
April 15, but it was a mess. It wasn't exactly a stellar search engine, but 
more of a proof of concept, which was the whole point of the project -- to 
show that we could provide high quality streaming video through a browser 
with a relatively good interface.

After the proof-of-concept project, we started to get serious, and I 
dropped most of the code base and started again from scratch. 

Pretty much the entire search engine now is in PHP, with the sole 
exceptions being the keyword indexer (Perl, as PHP was a lot slower doing 
the indexing) and a few extensions to the PHP engine. 

The search engine itself is fairly fast -- it can do a keyword search on a 
collection of nearly 8,000 video objects in an average of 0.02 to 0.20 
seconds or so, depending on the complexity of the query. It's features 
include:


* "Boolean"-type searches. Okay, not really, as in you can't use AND, OR 
and NOT, but you can use +/- style prefixes like in mnoGoSearch and 
whatnot. Words are automatically OR'ed, +'d words are AND'ed and -'d words 
are NOT'ed.

* Decent search times. On a PIII 500 with 128 MB of RAM, it still averages 
less than 0.20 seconds for a collection of 8,000 video objects and over 
100,000 keywords.

* Filtering. We're mostly an education-based site, so you can filter by 
things like subject (Physics, Chemistry, etc.) and grades (Kindergarden, 
grade 10, college, etc.)

* Spellchecking and somewhat fuzzy searches. Spellchecks work okay, but the 
fuzzy searches is kind of lame. (Porter stemming.) I might shoehorn in 
something like Metaphone-type stuff eventually. 

* Search ranking. Yes, keywords are given weights, everything is ranked and 
all that jazz. You know, inverse document frequencies, collection 
distribution, all that stuff. In the end, video objects returned in a 
search are given a ranking of 1 to 4 based on how well they match your 
query. It's not terribly advanced, and could use some tuning, but it's 
surprising how well it works.

* XML-based. The search engine itself runs as it's own daemon on either 
it's own server or along side the web site, and just waits for connections 
via a UNIX domain socket or a TCP socket. When it receives a query, and 
sends back an XML document containing the search results. This is 
especially nice -- you can use it with anything for any purpose, not just a 
web site, i.e. you can build an native app for Windows and you can still 
use the search engine, and just format the results via an XSL or whatever.


There are a lot of other nifty features, like being able to do remote admin 
via telnet or whatever. But in the end, it's still just a decent search 
engine and definitely not Google or even htdig. It's very focused on our 
specific task, the searching of online educational videos, so something 
using something like htdig would have required a lot of hacking to get it 
to where we wanted it.

So the morale I guess is, sure, you can make a half-decent search engine 
out of PHP. Ours gets the job done. But remember, I only had, like, a scant 
few months to write one, plus a web-based app to go around it, and I was 
alone on this one. PHP was great for RAD, and the damn thing even works to 
boot. My search engine could handle a web site easily enough, maybe even a 
group of sites, but it would totally suck ass as a WWW indexer/spider-type 
search engine. 

So there ya go.

J



Greg Schnippel wrote:

> 
>> * On 15-01-02 at 12:09
>> * Yogesh Mahadnac said....
>> 
>>>     Hi all! I want to develop a search engine in PHP for a
>>> portal that I'm working on at the moment, and I'd be glad if
>>> someone could please show me how to do it, or if anyone knows
>>> of a link where i can find a tutorial for that.
>>
>> I don't think PHP is really a very good language for a genuine www
>> search engine. (although it works very well on site-wide basis)
>> I'm sure more knowledgeable people than I can make some alternative
>> suggestions but I'm certain that PHP won't be the best tool
>> for the job.
> 
> I would concur with what everyone else is saying. If you need a search
> engine and you have system-level access on your machine, your best
> bet is to set up either htdig or mnogosearch (open source search
> engine packages) because they already have done the hard work of
> figuring out fuzzy matching and search ranking.
> 
> http://www.htdig.org/
> http://mnogosearch.org/
> 
> Alternatively, if you are using a database you can use some tricky sql
> statements to search your records for the user's search query. Here's
> a good tutorial that should get you started on this route:
> 
> http://www.devshed.com/Server_Side/PHP/Search_Engine/page1.html
> 
> 
> -schnippy


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Reply via email to