Space efficiency for values repeated across rows

Greg Kuperberg Fri, 24 Oct 2003 10:31:19 -0700

I am working on a mysql-based search facility.  In order to search for
connected phrases, it would be very convenient to have a "subtable"
structure to express a set of rows in which only one small column changes.


If I didn't have to worry about space, I might have a table like this:

create table windex (
    wordnum int unsigned not null,
    docnum int unsigned not null,
    pos smallint unsigned not null,
    [various keys]);

Then to search for a phrase like "pork and beans", I would have a
query like:

select docnum from windex as a,windex as b,windex as c
    where a.wordnum = 7 and b.wordnum = 10 and c.wordnum = 20
    and a.docnum = b.docnum and b.docnum = c.docnum
    and c.pos = a.pos+2 and c.pos = b.pos+1

(This is assuming that "pork" = 7, "and" = 10, and "beans" = 20.) Note
that this select statement illustrates a nuisance of the syntax for
multiple joins.  AFAIK, I have to say x = y and y = z, when I would like
to say x = y = z.  Or if it is not possible to interpret such a phrase
Python-style, I would like to have a function like equal(x,y,z).

Anyway, the syntax of this query is not the main point.  The real
problem is that the above table requires ten bytes for each occurrence
of each word in each document, plus significant extra B-tree structure.
Not only would it require more disk space than I will probably have,
it would also make searches highly disk-intensive. So instead I plan to
have a table like this:

create table windex (
    wordnum int unsigned not null,
    docnum int unsigned not null,
    positions blob not null,
    [various keys]);

Now I can encode the positions of each word in each document as two
bytes per occurrence.  The drawback is that to search for "pork and
beans", I have to retrieve all occurrences of pork AND and AND beans,
and then inspect each document for a position match.  This is not ideal,
but it looks like the best option at the moment.

The ideal solution would be an intelligent record structure that omits
wordnum and docnum when they repeat.  I realize that this looks a lot
like a MyISAM compressed table.  But I can't use that because my search
index will change incrementally every day.

I also see that MySQL has its own full-text search function. Well, my
project has some special requirements in regard to both search semantics
and scalability. I don't think that I could make do with just a simple
wrapper for MySQL's full-text search. Maybe I could exploit the full-text
search function for the narrow problem of finding a connected phrase.
It would be nice to know the data structures and algorithms for
the full-text search facility.

-- 
  /\  Greg Kuperberg (UC Davis)
 /  \
 \  / Visit the Math ArXiv Front at http://front.math.ucdavis.edu/
  \/  * All the math that's fit to e-print *

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Space efficiency for values repeated across rows

Reply via email to