From: Vikram A [mailto:vikkiatb...@yahoo.in] 
Sent: Wednesday, June 16, 2010 2:58 AM
To: je...@gii.co.jp; Andy; mysql@lists.mysql.com
Subject: Re: MySQL For Huge Collections

 

Hi All,

In this case how the images of a book will be stored, a chapter may contain 
number of images with different size.

Or It deals only text?

[JS] I was only thinking about text, but you can extend the idea to handle 
images by adding another table. Let’s assume that you want to associate each 
image with a line. Just add a table with a blob field in each record. Put the 
image in the blob and link it to the nearest line. A line record could link to 
any number of images, from zero to infinity.

The image table would have to have a lot more information than just the blob 
and the line number, of course. You’d need all kinds of page layout information 
for presentation purposes: is the image in line with the text, on the left, on 
the right, in the middle, below the text, etc. This is getting very complicated.

If you’re going to have images, then you can’t be starting with plain text. 
Depending upon the format of the original data, you might consider storing 
everything as HTML. That would make it somewhat more complicated to detect line 
boundaries, but it would preserve the layout for eventual presentation. You’ve 
just complicated the whole process enormously.

 

Regards,

 

Jerry Schwartz

Global Information Incorporated

195 Farmington Ave.

Farmington, CT 06032

 

860.674.8796 / FAX: 860.674.8341

 

www.the-infoshop.com

 

 

Thanks.

Vikram A

  _____  

From: Jerry Schwartz <je...@gii.co.jp>
To: Andy <listan...@gmail.com>; mysql@lists.mysql.com
Sent: Fri, 11 June, 2010 9:05:26 PM
Subject: RE: MySQL For Huge Collections

>-----Original Message-----
>From: Andy [mailto:listan...@gmail.com]
>Sent: Friday, June 11, 2010 8:09 AM
>To: mysql@lists.mysql.com
>Subject: Re: MySQL For Huge Collections
>
>Hello all,
>
>Thanks much for your replies.
>
>OK, so I realized that I may not have explained the problem clearly enough.
>I will try to do it now.
>
>I am a researcher in computational linguistics, and I am trying to research
>language usage and writing styles across different genres of books over the
>years. The system I am developing is not just to serve up e-book content
>(that will happen later possibly) but to help me analyze at micro-level the
>different constituent elements of a book ( say at chapter level or paragraph
>level). As part of this work, I need to break-up, store and repeatedly run
>queries across multiple e-books. Here are several additional sample queries:
>
>* give me books that use the word "ABC"
>* give me the first 10 pages of e-book "XYZ"
>* give me chapter 1 of all e-books
>
[JS] You pose an interesting challenge. Normally, my choice is to store "big 
things" as normal files and maintain the index (with accompanying descriptive 
information) in the database. You've probably seen systems like this, where 
you assign "tags" to pictures. That would certainly handle the second two 
cases (with some ancillary programming, of course).

Your first example is a bigger challenge. MySQL can do full text searches, but 
from what I've read they can get painfully slow. I never encountered that 
problem, but my databases are rather small (~100000 rows). For this technique, 
you would want to store all of your text in LONGTEXT columns.

I've also read that there are plug-ins that do the same thing, only faster.

I'm not sure how you would define a "page" of an e-book, and I suspect you 
would also deal with individual paragraphs or lines. My suggestion for that 
would be to have a "book" table, with such things as the title and author and 
perhaps ISBN; a "page" table identifying which paragraphs are on which page 
(for a given book); a "paragraph" table identifying which lines are in which 
paragraph; and then a "lines" table that contains the actual text of each 
line.

[book1, title, ...] <-> [book1, para1] <-> [para1, line1, linetext]
[book2, title, ...]    [book1, para2]    [para1, line2, linetext]
[book3, title, ...]    [book1, para3]    [para1, line3, linetext]
...                    [book1, para4]    [para1, line4, linetext]
                        ...                [para1, line5, linetext]
                                          ...

This would let you have a full text index on the titles, and another on the 
linetext, with a number of ways to limit your searches. Because the linetext 
field would be relatively short, the search should be relatively fast even 
though there might be a relatively large number of records returned if you 
wanted to search entire books.

NOTE: Small test cases might yield surprising results because of the way full 
text searches determine relevancy! This has bitten me more than once.

This was fun, I hope my suggestions make sense.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql?unsub=vikkiatb...@yahoo.in

 

Reply via email to