Hi Jeffrey, David already gave a lot of valid points. Table-per-documenttype seems the way to go here.
As to the 'best' db-scheme for your task given your description you have to ask yourself a couple of questions: Please note that with a document-type I mean a type like book, html page, etc. 1. with variable fields per document I gather you mean that fields can differ from document-type to document-type right? But at the same time for any given document-type you can infer beforehand what fields this document-type should contain right? This would be ideal for table-per-documenttype. You can add a book-table once you start integrating books, and afterwards create a table for say word-documents with different fields altogether. 2. do you want your application to be able to search over all documents of different document-types at once? The answer is probably yes. This means you will need to have 1 central table which contains foreign keys to each row in each of the table-per-documenttype tables. Call this table 'documents' or something (see below). Given that this is a requirement (I'm guessing here), you HAVE to do joins anyway when you want to search over ALL documents of all different types. 3. given if 2 is true it would be best in my opinion to put shared fields in the documents-table. Fields like author,title, google-like summary, publication_date, rating, filesize, content-type, fetch-date, Modification-date etc come to mind. 4. normalization vs. Speed. Normalization is a good thing... Having said that, when you often only want to query 1 document-type at a time, doing joins between 1 table-per-dcoumenttype table and the documents-table is a bit Waste of time. Especially with large number of rows (when I hear talk about html-pages I think alot of rows ;-) . For this you could consider keeping the shared fields (author, publication-date, etc) Redundant in each of the table-per-documenttype tables. Of course this means that you have to keep these redundant pieces of information in a consistent state, otherwise madness lies ahead. For the most part your import-application should take care of this, although perhaps other people know how to do this (partially) with MySQL. This requires some work / testing, but it would surely be worth it. What you're left with is this: ----------------------- | documents ----------------------- |id k |documentid fk |documenttypeid fk |author |summary |content-type |publication-date |fetch-date |... Field ID is globally unique over all documents of all different types Field Documentid is locally unique to all documents of type documenttypeid Field documenttypeid points to a table documenttypes where all different documenttypes are described (i.e: 1:Book, 2:HTML, 3:WORD-DOC,etc) ---------------------- |documenttypes -------------------------- |documenttypeid k |documenttype_name And N different table-per-documenttype tables For instance ------------------------------------------ | Books ------------------------------------------ | documentid k | (redundant shared fields) | book specific fields ------------------------------------------ | HTML ------------------------------------------ | documentid k | (redundant shared fields) | html specific fields (rawhtml , cleanedhtml, stripped_html,etc) THis enables you to: - use the documenttypeid to filter based on doucmenttypes (only books and pdf's no problem) - use the documenttypeid to determine with which table-per-documenttype table to join. (define a map-structure in your application which holds the relation between documenttypeid and the table with which to join or you could add the table-name as A column to the Documenttypes-table but the lattter option would require joining with the documenttypes-table which is otherwise not needed. - NOTE: that you only have to join between documents-table and any of the table-per-documenttypes tables when you require specific fields from any of the table-per-documenttypes tables! - use the shared fields in the documents-table to filter/sort over ALL documents. - query directly on 1 of the table-per-documenttype tables when you know that there's only 1 type to query a-priori and you need the document-specific fields. (and use the shared fields and specific fields from that table without requiring a join) All in all this approach works very well if you happen to have a master-detail application (which is usually the case): master page shows N results which can be clicked through to get to the actual result (which is the detail page). If you think beforehand what data you want to show in the master-page (title, author, summary, publication date?) it may become possible to make these fields shared (so they exist in The documents-table). This in turn would mean that: 1. for any query which produces a master-page you only need to query the documents-table. (given that you don't have to filter / sort on any of the document-type specific fields) 2. clicking through to a detail-page requires only 1 query to the document-type table of which the entity clicked on happens to belong (remember I entity could only belong to 1 type) 3. If your application enables the user to specify to search in a specific document-type, that document-type table could be used to produce the master-page without requiring a join. Master-detail without requiring joins. Hope this makes sense, Cheers, Geert-Jan -----Oorspronkelijk bericht----- Van: Jeffrey Santos [mailto:[EMAIL PROTECTED] Verzonden: woensdag 17 september 2008 17:39 Aan: David Ashley CC: mysql@lists.mysql.com Onderwerp: Re: Appropriate Design Hi David, My project involves storing information that could have been gathered from wildly different sources. You can think of this part of the database as a sort of bibliography for the sources of that information. If I gathered information from a book, for instance, there are some generally required fields such as "Author," "Title," ... etc. A website, on the other hand, would also require fields such as "URL" and "Date Visited" and so on and so forth. I hope that's more clear! ~Jeffrey Santos On Wed, Sep 17, 2008 at 9:57 AM, David Ashley <[EMAIL PROTECTED]> wrote: > On Wed, Sep 17, 2008 at 5:46 AM, Jeffrey Santos <[EMAIL PROTECTED]>wrote: > >> I'm not sure if this is the right list to ask such a question; if not, any >> direction as to the appropriate forum would be much appreciated! >> >> This may be a simple question, but I'm still on the early stages of my >> learning about databases. I'm developing an application that will require >> information from various sources. Since what might be considered required >> information about those sources will vary (occasionally dramatically), I'm >> unsure as to the correct design for the database tables. In other words >> I'm >> left with (I believe) two choices: >> >> 1. Create a source table that will contain, among other things, every >> possible field for a source type and sort out what is considered required >> information at the application level. >> >> 2. Create a few source tables detailing the required information about >> each source type. The only way I can think of doing this is something >> along >> the lines of: >> >> sourceFields: fieldId, fieldName, .... >> sourceRequireds: reqId, typeId, fieldId, .... >> sourceInfo: infoId, fieldId, sourceId, .... >> >> and then pull only the appropriate data from the database at query time >> using JOINs. >> >> Which of these options (or a third I have no idea about!) would be >> appropriate here? >> >> If possible, maybe a general "conventional wisdom" statement would greatly >> help my education on these matters! >> >> Thank you, >> >> Jeffrey Santos > > > The second solution (multiple tables) is generally the only acceptable > solution. > > The general rule (for one table or many) is whether the fields in the > single table are fixed in quantity or perhaps are repetitive enough to be > tedious or repetitive to manipulate. > > For example, assume you want a database of people in your state and the > cars they own. Clearly, since people vary widely in the number of cars they > own and since each car has similar data, puttling fields like car1vin, > car2vin, car3vin in the people table is the wrong solution. > > Your problem description is rather vague, but the way I'm reading it is > that your sources vary widely in the fields that need to be recorded. In > that case, you'd generally want a table of sources (each instance of a > source), a table of fields (each field that MAY be associated with a > source), and a table of sourcefields (the actual data for a given field with > a given source). > > But you really need to solidify the description of your problem or have a > simplest example so people can really help you. I don't fully understand > your description. > > Dave. > -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]