I am working on a mysql-based search facility. In order to search for connected phrases, it would be very convenient to have a "subtable" structure to express a set of rows in which only one small column changes.
If I didn't have to worry about space, I might have a table like this: create table windex ( wordnum int unsigned not null, docnum int unsigned not null, pos smallint unsigned not null, [various keys]); Then to search for a phrase like "pork and beans", I would have a query like: select docnum from windex as a,windex as b,windex as c where a.wordnum = 7 and b.wordnum = 10 and c.wordnum = 20 and a.docnum = b.docnum and b.docnum = c.docnum and c.pos = a.pos+2 and c.pos = b.pos+1 (This is assuming that "pork" = 7, "and" = 10, and "beans" = 20.) Note that this select statement illustrates a nuisance of the syntax for multiple joins. AFAIK, I have to say x = y and y = z, when I would like to say x = y = z. Or if it is not possible to interpret such a phrase Python-style, I would like to have a function like equal(x,y,z). Anyway, the syntax of this query is not the main point. The real problem is that the above table requires ten bytes for each occurrence of each word in each document, plus significant extra B-tree structure. Not only would it require more disk space than I will probably have, it would also make searches highly disk-intensive. So instead I plan to have a table like this: create table windex ( wordnum int unsigned not null, docnum int unsigned not null, positions blob not null, [various keys]); Now I can encode the positions of each word in each document as two bytes per occurrence. The drawback is that to search for "pork and beans", I have to retrieve all occurrences of pork AND and AND beans, and then inspect each document for a position match. This is not ideal, but it looks like the best option at the moment. The ideal solution would be an intelligent record structure that omits wordnum and docnum when they repeat. I realize that this looks a lot like a MyISAM compressed table. But I can't use that because my search index will change incrementally every day. I also see that MySQL has its own full-text search function. Well, my project has some special requirements in regard to both search semantics and scalability. I don't think that I could make do with just a simple wrapper for MySQL's full-text search. Maybe I could exploit the full-text search function for the narrow problem of finding a connected phrase. It would be nice to know the data structures and algorithms for the full-text search facility. -- /\ Greg Kuperberg (UC Davis) / \ \ / Visit the Math ArXiv Front at http://front.math.ucdavis.edu/ \/ * All the math that's fit to e-print * -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]