C++)

Christian Jaeger Thu, 13 Sep 2001 21:18:45 -0700
Hello

[ It seems the post didn't make it through the first time ]

While programming a journal in perl/axkit I realize that the problems 
of both creating useful indexes for searching content efficiently and 
parse user input and create the right sql queries from it are sooo 
common that there *must* be some good library already. :-) So I 
headed over to CPAN, but didn't really find what I was looking for.

It should create indexes that are efficiently searchable in mysql, 
i.e. only <select ... where .. like "abcd%"> queries, not "%abc%". 
Allow to search for word parts (i.e. find "fulltext" when entering 
"text"). Allow for multiple form fields (i.e. one field for title 
words, one for author names, etc.) at once. Preferably allow for some 
sort of query rules (AND/NOT/OR or something).
Preferably do some relevance sorting. Preferably allow to hook some 
numbers (link or access counts etc) into the relevance sorting.

I think there are 3 tough parts which are needed:
1. creation of sophisticated index structures (inverted indexes)
2. somehow recognize sub-word boundaries to split words on. Maybe use 
some form of thesaurus? Or syllables? (I suspect it should be the 
same rules as for splitting words on line boundaries)
3. user input parser / query creator

Why not:

- use mysql's fulltext indexes? Because I think that currently they 
are too limited (i.e. see user comments about them 
www.mysql.com/doc/) (should be better in mysql-4, I read, but we need 
it in a few weeks already...). And they are also not supported in 
Innodb which we want to use.

- use indexing robots? Because we work with XML documents, and would 
like to both keep the index up to date immediately, as well as split 
the XML contents into several parts (i.e. there's a title, byline, 
etcetc, which should be searchable or weigted differently). We want a 
*library*, not a finished product.

There's Lucene (www.lucene.com) in Java that I think does exactly 
what I want. Anyone who helps me port that to perl or 
C(++)/perl-bindings (-; ? (It should be ready in a few weeks, and 
it's about 500k source code :-().

(Something in C/C++ that would be loaded as UDF or so would be nice 
too, but as I understand (from recent discussion about stored 
procedures) it's not possible since these UDF's would have to start 
other queries (i.e. to insert each word fragment into an index 
table).)

Like Daniel Gardner has pointed out to me, one could maybe use 
Search::InvertedIndex as a basis and complement it with Lingua::Stem 
(only english) or Text::German (german) (both seem to be quite 
imperfect tough) or with some word list processing. (I don't 
understand Search::InvertedIndex enough yet.) I think it would still 
be much work.


Has someone finished something like this? More info about mysql4?

Thx
Christian.

---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Fulltext indexing libraries (perl/C/C++)

Reply via email to