>-----Original Message----- >From: Andy [mailto:listan...@gmail.com] >Sent: Friday, June 11, 2010 8:09 AM >To: mysql@lists.mysql.com >Subject: Re: MySQL For Huge Collections > >Hello all, > >Thanks much for your replies. > >OK, so I realized that I may not have explained the problem clearly enough. >I will try to do it now. > >I am a researcher in computational linguistics, and I am trying to research >language usage and writing styles across different genres of books over the >years. The system I am developing is not just to serve up e-book content >(that will happen later possibly) but to help me analyze at micro-level the >different constituent elements of a book ( say at chapter level or paragraph >level). As part of this work, I need to break-up, store and repeatedly run >queries across multiple e-books. Here are several additional sample queries: > >* give me books that use the word "ABC" >* give me the first 10 pages of e-book "XYZ" >* give me chapter 1 of all e-books > [JS] You pose an interesting challenge. Normally, my choice is to store "big things" as normal files and maintain the index (with accompanying descriptive information) in the database. You've probably seen systems like this, where you assign "tags" to pictures. That would certainly handle the second two cases (with some ancillary programming, of course).
Your first example is a bigger challenge. MySQL can do full text searches, but from what I've read they can get painfully slow. I never encountered that problem, but my databases are rather small (~100000 rows). For this technique, you would want to store all of your text in LONGTEXT columns. I've also read that there are plug-ins that do the same thing, only faster. I'm not sure how you would define a "page" of an e-book, and I suspect you would also deal with individual paragraphs or lines. My suggestion for that would be to have a "book" table, with such things as the title and author and perhaps ISBN; a "page" table identifying which paragraphs are on which page (for a given book); a "paragraph" table identifying which lines are in which paragraph; and then a "lines" table that contains the actual text of each line. [book1, title, ...] <-> [book1, para1] <-> [para1, line1, linetext] [book2, title, ...] [book1, para2] [para1, line2, linetext] [book3, title, ...] [book1, para3] [para1, line3, linetext] ... [book1, para4] [para1, line4, linetext] ... [para1, line5, linetext] ... This would let you have a full text index on the titles, and another on the linetext, with a number of ways to limit your searches. Because the linetext field would be relatively short, the search should be relatively fast even though there might be a relatively large number of records returned if you wanted to search entire books. NOTE: Small test cases might yield surprising results because of the way full text searches determine relevancy! This has bitten me more than once. This was fun, I hope my suggestions make sense. Regards, Jerry Schwartz Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 www.the-infoshop.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=arch...@jab.org