Hello
[ It seems the post didn't make it through the first time ]
While programming a journal in perl/axkit I realize that the problems
of both creating useful indexes for searching content efficiently and
parse user input and create the right sql queries from it are sooo
common that there *must* be some good library already. :-) So I
headed over to CPAN, but didn't really find what I was looking for.
It should create indexes that are efficiently searchable in mysql,
i.e. only <select ... where .. like "abcd%"> queries, not "%abc%".
Allow to search for word parts (i.e. find "fulltext" when entering
"text"). Allow for multiple form fields (i.e. one field for title
words, one for author names, etc.) at once. Preferably allow for some
sort of query rules (AND/NOT/OR or something).
Preferably do some relevance sorting. Preferably allow to hook some
numbers (link or access counts etc) into the relevance sorting.
I think there are 3 tough parts which are needed:
1. creation of sophisticated index structures (inverted indexes)
2. somehow recognize sub-word boundaries to split words on. Maybe use
some form of thesaurus? Or syllables? (I suspect it should be the
same rules as for splitting words on line boundaries)
3. user input parser / query creator
Why not:
- use mysql's fulltext indexes? Because I think that currently they
are too limited (i.e. see user comments about them
www.mysql.com/doc/) (should be better in mysql-4, I read, but we need
it in a few weeks already...). And they are also not supported in
Innodb which we want to use.
- use indexing robots? Because we work with XML documents, and would
like to both keep the index up to date immediately, as well as split
the XML contents into several parts (i.e. there's a title, byline,
etcetc, which should be searchable or weigted differently). We want a
*library*, not a finished product.
There's Lucene (www.lucene.com) in Java that I think does exactly
what I want. Anyone who helps me port that to perl or
C(++)/perl-bindings (-; ? (It should be ready in a few weeks, and
it's about 500k source code :-().
(Something in C/C++ that would be loaded as UDF or so would be nice
too, but as I understand (from recent discussion about stored
procedures) it's not possible since these UDF's would have to start
other queries (i.e. to insert each word fragment into an index
table).)
Like Daniel Gardner has pointed out to me, one could maybe use
Search::InvertedIndex as a basis and complement it with Lingua::Stem
(only english) or Text::German (german) (both seem to be quite
imperfect tough) or with some word list processing. (I don't
understand Search::InvertedIndex enough yet.) I think it would still
be much work.
Has someone finished something like this? More info about mysql4?
Thx
Christian.
---------------------------------------------------------------------
Before posting, please check:
http://www.mysql.com/manual.php (the manual)
http://lists.mysql.com/ (the list archive)
To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php