Re: Slow results with simple, well-indexed query

Jesse Sheidlower Fri, 22 Aug 2003 11:42:36 -0700

On Fri, Aug 22, 2003 at 09:03:55AM -0700, Steven Roussey wrote:
> > All the indexes were single indexes, partly because I haven't
> > yet made the effort to understand composite index. I guess it's
> > time ;-).
> 
> Oh.
> 
> There are better places to start than this list. ;) The manual can be a
> great starting place, and several people on this list have written books
> about MySQL which are great for getting started. You can look at
> Amazon.com, etc.


No, I have them all, I just...I guess haven't gotten around to
them yet. I mistakenly figured that individually indexing
everything in sight would do the trick.

> > mysql> EXPLAIN SELECT cg.cw FROM cg,q,cit,sref
> >     -> WHERE cg.q_id = q.id AND q.cit_id = cit.id AND cit.sref_id =
> sref.id
> >     -> AND cg.cw LIKE 't%'
> >     -> AND sref.cd > DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
> >     -> ORDER BY cg.cw
> >     -> LIMIT 1000,10;
> 
> You do a range on the fist and last table in the chain. :( 
> 
> It is best to normalize your table structure. I should have noticed that
> upfront. The 'simple, well-indexed query' through me off. Normalizing is
> a great thing to learn, and probably the first thing to understand after
> how to do a SELECT and composite indexes.

I wonder if anyone else is still reading :-/.

While I don't claim to be an expert by any means, and while it may not
look like it, this database _is_ highly normalized, or at least I think
it is; I put a lot of effort into making it so. Let me just explain what
it all does.

The database holds lexical data for use in linguistic analysis. For the
tables we're looking at now, it works like this:

sref    contains the bibliographic details for a single physical book
        or other text, along with some metadata about when the text
        was keyed, who read it, etc.

cit     contains the bibliographic details for a quotation taken from
        a text; there might be only one cit per sref if these details
        don't change (e.g. a novel), or there might be many cits per
        sref if they do (e.g. a collection of essays, each one written
        by a different author at a different date).

q       contains a single quotation.

cg      contains information about specific words.

I believe this is normalized because each element appears only once and
is linked to others as necessary, so that, for example, a single quotation
reading "This is a foobarred, conformiferous structiform" is stored once
in the database, and "foobarred", "conformiferous", and "structiform"
are each stored independently, joined to the q table on the relevant
id fields. So if you had the word "structiform" and wanted to see the
quotation, you could retrive it by the q_id that's stored in cg. It's
the same for the other elements; there's only one sref section per 
text, so if you're at any other point and want to get the date the
text was read, you have to join your way up to sref to retrieve 
sref.cd.

Though I haven't shown them yet, there are also some other tables,
holding authors at the sref level, subjects at the sref level, and
authors at the cit level; for any of these, there can be any number
of elements (i.e. no subjects, one subject, ten subjects etc.), so
I normalized them by putting them in to separate tables linked
to sref.id or cit.id etc. as appropriate.

So for the queries we've been discussing, I'm looking for all the
words beginning with "t" that have been keyed in the last six months;
the reason I'm getting from the first and last table in the chain 
should now be obvious. In reality, I wouldn't just be retrieving
the word (the cg.cw), I'd be retrieving both the quote and bibliographic
info in cit, and the (unshown) author related to cit. Other queries
can involve any combination of these factors--searching based on the
fulltext content of a q, on an author, on a work title (perhaps not
shown), on subjects, and so on and so on. I have created (individual)
indexes on all the id fields, and on any value used in a search. In
almost, if not every case, a search will involve a particular value
used for the search as well as the id fields of relevant tables; what
that means for multiple queries, I'm now not sure.

I'll stop there and try to answer the suggestions in your other
message. Thanks again for all the time you've been spending.

Best,

Jesse

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Slow results with simple, well-indexed query

Reply via email to