>-----Original Message-----
>From: Andy [mailto:listan...@gmail.com]
>Sent: Friday, June 11, 2010 8:09 AM
>To: mysql@lists.mysql.com
>Subject: Re: MySQL For Huge Collections
>
>Hello all,
>
>Thanks much for your replies.
>
>OK, so I realized that I may not have explained the problem clearly enough.
>I will try to do it now.
>
>I am a researcher in computational linguistics, and I am trying to research
>language usage and writing styles across different genres of books over the
>years. The system I am developing is not just to serve up e-book content
>(that will happen later possibly) but to help me analyze at micro-level the
>different constituent elements of a book ( say at chapter level or paragraph
>level). As part of this work, I need to break-up, store and repeatedly run
>queries across multiple e-books. Here are several additional sample queries:
>
>* give me books that use the word "ABC"
>* give me the first 10 pages of e-book "XYZ"
>* give me chapter 1 of all e-books
>
[JS] You pose an interesting challenge. Normally, my choice is to store "big 
things" as normal files and maintain the index (with accompanying descriptive 
information) in the database. You've probably seen systems like this, where 
you assign "tags" to pictures. That would certainly handle the second two 
cases (with some ancillary programming, of course).

Your first example is a bigger challenge. MySQL can do full text searches, but 
from what I've read they can get painfully slow. I never encountered that 
problem, but my databases are rather small (~100000 rows). For this technique, 
you would want to store all of your text in LONGTEXT columns.

I've also read that there are plug-ins that do the same thing, only faster.

I'm not sure how you would define a "page" of an e-book, and I suspect you 
would also deal with individual paragraphs or lines. My suggestion for that 
would be to have a "book" table, with such things as the title and author and 
perhaps ISBN; a "page" table identifying which paragraphs are on which page 
(for a given book); a "paragraph" table identifying which lines are in which 
paragraph; and then a "lines" table that contains the actual text of each 
line.

[book1, title, ...] <-> [book1, para1] <-> [para1, line1, linetext]
[book2, title, ...]     [book1, para2]     [para1, line2, linetext]
[book3, title, ...]     [book1, para3]     [para1, line3, linetext]
...                     [book1, para4]     [para1, line4, linetext]
                        ...                [para1, line5, linetext]
                                           ...

This would let you have a full text index on the titles, and another on the 
linetext, with a number of ways to limit your searches. Because the linetext 
field would be relatively short, the search should be relatively fast even 
though there might be a relatively large number of records returned if you 
wanted to search entire books.

NOTE: Small test cases might yield surprising results because of the way full 
text searches determine relevancy! This has bitten me more than once.

This was fun, I hope my suggestions make sense.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql?unsub=arch...@jab.org

Reply via email to