RE: Appropriate Design

Geert-Jan Brits Wed, 17 Sep 2008 09:50:31 -0700

Hi Jeffrey, 

David already gave a lot of valid points. 
Table-per-documenttype seems the way to go here.

As to the 'best' db-scheme for your task given your description you have to
ask yourself a couple of questions:
Please note that with a document-type I mean a type like book, html page,
etc. 

1. with variable fields per document I gather you mean that fields can
differ from document-type to document-type right? 
But at the same time for any given document-type you can infer beforehand
what fields this document-type should contain right? 
This would be ideal for table-per-documenttype. You can add a book-table
once you start integrating books, and afterwards create a table for say
word-documents with different fields altogether. 

2. do you want your application to be able to search over all documents of
different document-types at once?
The answer is probably yes. 
This means you will need to have 1 central table which contains foreign keys
to each row in each of the table-per-documenttype tables. 
Call this table 'documents' or something (see below). 
Given that this is a requirement (I'm guessing here), you HAVE to do joins
anyway when you want to search over ALL documents of all different types.

3. given if 2 is true it would be best in my opinion to put shared fields in
the documents-table. Fields like author,title, google-like summary,
publication_date, rating, filesize, content-type, fetch-date,
Modification-date etc come to mind. 

4. normalization vs. Speed. 
Normalization is a good thing... Having said that, when you often only want
to query 1 document-type at a time, doing joins between 1
table-per-dcoumenttype table and the documents-table is a bit 
Waste of time. Especially with large number of rows (when I hear talk about
html-pages I think alot of rows ;-) .   For this you could consider keeping
the shared fields (author, publication-date, etc)
Redundant in each of the table-per-documenttype tables. 

Of course this means that you have to keep these redundant pieces of
information in a consistent state, otherwise madness lies ahead. 
For the most part your import-application should take care of this, although
perhaps other people know how to do this (partially) with MySQL.
This requires some work / testing, but it would surely be worth it. 

What you're left with is this:   

-----------------------
| documents
-----------------------
|id                     k
|documentid             fk
|documenttypeid fk
|author
|summary
|content-type
|publication-date
|fetch-date
|...

Field ID is globally unique over all documents of all different types
Field Documentid is locally unique to all documents of type documenttypeid
Field documenttypeid points to a table documenttypes where all different
documenttypes are described (i.e: 1:Book, 2:HTML, 3:WORD-DOC,etc)

----------------------
|documenttypes
--------------------------
|documenttypeid         k
|documenttype_name      

And N different table-per-documenttype tables

For instance 

------------------------------------------
| Books
------------------------------------------
| documentid                            k
| (redundant shared fields)
| book specific fields

------------------------------------------
| HTML
------------------------------------------
| documentid                            k
| (redundant shared fields)
| html specific fields (rawhtml , cleanedhtml, stripped_html,etc)

THis enables you to:
- use the documenttypeid to filter based on doucmenttypes (only books and
pdf's no problem)
- use the documenttypeid to determine with which table-per-documenttype
table to join. 
                (define a map-structure in your application which holds the
relation between documenttypeid and the table with which to join or  you
could add the table-name as
        A column to the Documenttypes-table but the lattter option would
require joining with the documenttypes-table which is otherwise not needed.

- NOTE: that you only have to join between documents-table and any of the
table-per-documenttypes tables when you require specific fields from any of
the table-per-documenttypes tables!
- use the shared fields in the documents-table to filter/sort over ALL
documents.
- query directly on 1 of the table-per-documenttype tables when you know
that there's only 1 type to query a-priori and you need the
document-specific fields.
 (and use the shared fields and specific fields from that table without
requiring a join)

All in all this approach works very well if you happen to have a
master-detail application (which is usually the case): master page shows N
results which can be clicked through to get to the actual result (which is
the detail page). 

If you think beforehand what data you want to show in the master-page
(title, author,  summary, publication date?) it may become possible to make
these fields shared (so they exist in 
The documents-table). This in turn would mean that: 
1. for any query which produces a master-page you only need to query the
documents-table. (given that you don't have to filter / sort on any of the
document-type specific fields)
2. clicking through to a detail-page requires only 1 query to the
document-type table of which the entity clicked on happens to belong
(remember I entity could only belong to 1 type)
3. If your application enables the user to specify to search in a specific
document-type, that document-type table could be used to produce the
master-page without requiring a join.

Master-detail without requiring joins.

Hope this makes sense,
Cheers,
Geert-Jan

-----Oorspronkelijk bericht-----
Van: Jeffrey Santos [mailto:[EMAIL PROTECTED] 
Verzonden: woensdag 17 september 2008 17:39
Aan: David Ashley
CC: mysql@lists.mysql.com
Onderwerp: Re: Appropriate Design

Hi David,

My project involves storing information that could have been gathered from
wildly different sources.  You can think of this part of the database as a
sort of bibliography for the sources of that information.  If I gathered
information from a book, for instance, there are some generally required
fields such as "Author," "Title," ... etc.  A website, on the other hand,
would also require fields such as "URL" and "Date Visited" and so on and so
forth.

I hope that's more clear!

~Jeffrey Santos

On Wed, Sep 17, 2008 at 9:57 AM, David Ashley <[EMAIL PROTECTED]> wrote:

> On Wed, Sep 17, 2008 at 5:46 AM, Jeffrey Santos
<[EMAIL PROTECTED]>wrote:
>
>> I'm not sure if this is the right list to ask such a question; if not,
any
>> direction as to the appropriate forum would be much appreciated!
>>
>> This may be a simple question, but I'm still on the early stages of my
>> learning about databases.  I'm developing an application that will
require
>> information from various sources.  Since what might be considered
required
>> information about those sources will vary (occasionally dramatically),
I'm
>> unsure as to the correct design for the database tables.  In other words
>> I'm
>> left with (I believe) two choices:
>>
>> 1.   Create a source table that will contain, among other things, every
>> possible field for a source type and sort out what is considered required
>> information at the application level.
>>
>> 2.   Create a few source tables detailing the required information about
>> each source type.  The only way I can think of doing this is something
>> along
>> the lines of:
>>
>> sourceFields: fieldId, fieldName, ....
>> sourceRequireds: reqId, typeId, fieldId, ....
>> sourceInfo: infoId, fieldId, sourceId, ....
>>
>> and then pull only the appropriate data from the database at query time
>> using JOINs.
>>
>> Which of these options (or a third I have no idea about!) would be
>> appropriate here?
>>
>> If possible, maybe a general "conventional wisdom" statement would
greatly
>> help my education on these matters!
>>
>> Thank you,
>>
>> Jeffrey Santos
>
>
> The second solution (multiple tables) is generally the only acceptable
> solution.
>
> The general rule (for one table or many) is whether the fields in the
> single table are fixed in quantity or perhaps are repetitive enough to be
> tedious or repetitive to manipulate.
>
> For example, assume you want a database of people in your state and the
> cars they own.  Clearly, since people vary widely in the number of cars
they
> own and since each car has similar data, puttling fields like car1vin,
> car2vin, car3vin in the people table is the wrong solution.
>
> Your problem description is rather vague, but the way I'm reading it is
> that your sources vary widely in the fields that need to be recorded.  In
> that case, you'd generally want a table of sources (each instance of a
> source), a table of fields (each field that MAY be associated with a
> source), and a table of sourcefields (the actual data for a given field
with
> a given source).
>
> But you really need to solidify the description of your problem or have a
> simplest example so people can really help you.  I don't fully understand
> your description.
>
> Dave.
>

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

RE: Appropriate Design

Reply via email to