Re: Fulltext indexing libraries (perl/C/C++)

2001-09-14 Thread ryc

Your original message got through the first time, but your email bounced.

I think what you are looking for is called mifluz and is the indexing
library that htdig uses. The link is http://www.gnu.org/software/mifluz/ .

If you develop any kind of bindings to use mifluz to index a mysql database
let me know I would definitly be interested.

ryan

- Original Message - 
From: Christian Jaeger [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, September 14, 2001 12:43 AM
Subject: Fulltext indexing libraries (perl/C/C++)


 Hello
 
 [ It seems the post didn't make it through the first time ]
 
 While programming a journal in perl/axkit I realize that the problems 
 of both creating useful indexes for searching content efficiently and 
 parse user input and create the right sql queries from it are sooo 
 common that there *must* be some good library already. :-) So I 
 headed over to CPAN, but didn't really find what I was looking for.
 
 It should create indexes that are efficiently searchable in mysql, 
 i.e. only select ... where .. like abcd% queries, not %abc%. 
 Allow to search for word parts (i.e. find fulltext when entering 
 text). Allow for multiple form fields (i.e. one field for title 
 words, one for author names, etc.) at once. Preferably allow for some 
 sort of query rules (AND/NOT/OR or something).
 Preferably do some relevance sorting. Preferably allow to hook some 
 numbers (link or access counts etc) into the relevance sorting.
 
 I think there are 3 tough parts which are needed:
 1. creation of sophisticated index structures (inverted indexes)
 2. somehow recognize sub-word boundaries to split words on. Maybe use 
 some form of thesaurus? Or syllables? (I suspect it should be the 
 same rules as for splitting words on line boundaries)
 3. user input parser / query creator
 
 Why not:
 
 - use mysql's fulltext indexes? Because I think that currently they 
 are too limited (i.e. see user comments about them 
 www.mysql.com/doc/) (should be better in mysql-4, I read, but we need 
 it in a few weeks already...). And they are also not supported in 
 Innodb which we want to use.
 
 - use indexing robots? Because we work with XML documents, and would 
 like to both keep the index up to date immediately, as well as split 
 the XML contents into several parts (i.e. there's a title, byline, 
 etcetc, which should be searchable or weigted differently). We want a 
 *library*, not a finished product.
 
 There's Lucene (www.lucene.com) in Java that I think does exactly 
 what I want. Anyone who helps me port that to perl or 
 C(++)/perl-bindings (-; ? (It should be ready in a few weeks, and 
 it's about 500k source code :-().
 
 (Something in C/C++ that would be loaded as UDF or so would be nice 
 too, but as I understand (from recent discussion about stored 
 procedures) it's not possible since these UDF's would have to start 
 other queries (i.e. to insert each word fragment into an index 
 table).)
 
 Like Daniel Gardner has pointed out to me, one could maybe use 
 Search::InvertedIndex as a basis and complement it with Lingua::Stem 
 (only english) or Text::German (german) (both seem to be quite 
 imperfect tough) or with some word list processing. (I don't 
 understand Search::InvertedIndex enough yet.) I think it would still 
 be much work.
 
 
 Has someone finished something like this? More info about mysql4?
 
 Thx
 Christian.
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Fulltext indexing libraries (perl/C/C++)

2001-09-13 Thread Christian Jaeger

Hello

[ It seems the post didn't make it through the first time ]

While programming a journal in perl/axkit I realize that the problems 
of both creating useful indexes for searching content efficiently and 
parse user input and create the right sql queries from it are sooo 
common that there *must* be some good library already. :-) So I 
headed over to CPAN, but didn't really find what I was looking for.

It should create indexes that are efficiently searchable in mysql, 
i.e. only select ... where .. like abcd% queries, not %abc%. 
Allow to search for word parts (i.e. find fulltext when entering 
text). Allow for multiple form fields (i.e. one field for title 
words, one for author names, etc.) at once. Preferably allow for some 
sort of query rules (AND/NOT/OR or something).
Preferably do some relevance sorting. Preferably allow to hook some 
numbers (link or access counts etc) into the relevance sorting.

I think there are 3 tough parts which are needed:
1. creation of sophisticated index structures (inverted indexes)
2. somehow recognize sub-word boundaries to split words on. Maybe use 
some form of thesaurus? Or syllables? (I suspect it should be the 
same rules as for splitting words on line boundaries)
3. user input parser / query creator

Why not:

- use mysql's fulltext indexes? Because I think that currently they 
are too limited (i.e. see user comments about them 
www.mysql.com/doc/) (should be better in mysql-4, I read, but we need 
it in a few weeks already...). And they are also not supported in 
Innodb which we want to use.

- use indexing robots? Because we work with XML documents, and would 
like to both keep the index up to date immediately, as well as split 
the XML contents into several parts (i.e. there's a title, byline, 
etcetc, which should be searchable or weigted differently). We want a 
*library*, not a finished product.

There's Lucene (www.lucene.com) in Java that I think does exactly 
what I want. Anyone who helps me port that to perl or 
C(++)/perl-bindings (-; ? (It should be ready in a few weeks, and 
it's about 500k source code :-().

(Something in C/C++ that would be loaded as UDF or so would be nice 
too, but as I understand (from recent discussion about stored 
procedures) it's not possible since these UDF's would have to start 
other queries (i.e. to insert each word fragment into an index 
table).)

Like Daniel Gardner has pointed out to me, one could maybe use 
Search::InvertedIndex as a basis and complement it with Lingua::Stem 
(only english) or Text::German (german) (both seem to be quite 
imperfect tough) or with some word list processing. (I don't 
understand Search::InvertedIndex enough yet.) I think it would still 
be much work.


Has someone finished something like this? More info about mysql4?

Thx
Christian.

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php