Tito Ciuro wrote:


I have no idea why you're so angry. Anyway, there are so many things I can think of saying, I'll just make it brief and to the point.

This thing won't scale. I'd like to see it when you have the 4.5 million records my database contains,
and that is still tiny for all intents-and-purposes.


For the type of application I'm building, it doesn't need to scale. At most I'll have 25K records. Even with the 100K database I've used for testing, it simply flies.

I'm not building anything for the outside world. Just something that serves me well.

I'm storing variable-length data, with a very different set of attributes. Some may have 1 attribute, others tens of them. Perhaps even a hundred. Using a column per attribute is not a good idea. A few days ago I asked this question and Dr. Hipp mentioned:

The more tables you have, the slower the first query will run
and the more memory SQLite will use.  For long-running applications
where the startup time is not a significant factor, 100s or
1000s of tables is fine.  For a CGI script that starts itself
up anew several times per second, then you should try to keep
the number of tables below a 100, I think.  Less than that if
you can. You should also try and keep down the number of tables
in low-memory embedded applications, in order to save on memory
usages.  Each table takes a few hundred bytes of memory - depending
on the number of columns and features.


Having two columns (one for the key and the other one for the data itself) seems like a good balance between speed and ease of use. I don't care if it doesn't scale because the intended deployment is 25K at most, as I said earlier. Even with data sets 4x that size works fine.

There is absolutely no reason this could not be properly designed and STILL fit any possible need for that ugly packed record.


Tito,

While the tone of MGC's post may have been a little over the top, many of his points were valid.

All he was really suggesting is that a set of normalized tables would do the same thing, and probably preform better, especially when scaled to larger tables.

In your application you have 25K files each with 1 to 100 attributes. You can still store your data in two normalized tables.

The main idea is to use more than one row in the value table for each file. These values are all tied to the file name by a common file id value. This table schema is shown below.

   -- basic two table attribute setup

   create table File (
       Id          integer primary key;
Name text; -- name of file including path
   );

   create table Attribute (
       FileId      integer references File(Id); -- foreign key
       Name        text;                       -- name of attribute
Value text; -- value of attribute primary key (FileId, Name); );

   create index AttributeValue on Attribute(Name, Value);

With these tables you will have 25K rows in the File table, one per file, and 250K rows in the Attribute table assuming an average of 10 attributes per file (your example had 14).

An index on the attribute name and value pairs will greatly speedup lookups of a particular attribute.

There is still some redundancy in this set of tables since the names of the attributes are being repeated over and over in the Attribute table. This may not be a big problem, but you really only have 100 attributes, so there is no sense storing 25K copies of the name of a common attribute that applies to all files. This information can be factored out into a third table as shown below.


   -- Factor out the attribute definitions into a separate table

   create table File (
       Id          integer primary key;
Name text; -- name of file including path
   );

   create table AttributeDefinition (
       Id          integer primary key;
Name text unique; -- name of the attribute Type text; -- type of the attribute
   );

   create table Attribute (
       FileId      integer references File(Id);
       AttId       integer references AttributeDefinition(Id);
Value text; -- value of attribute primary key (FileId, AttId); );

   create index AttributeValue on Attribute(AttId, Value);

This will give you three tables (which will not cause a noticeable difference in the database open time). One with 25K rows of file names. One with 100 rows of attribute definition data, which includes the attribute name. And one with 250K rows of attribute value data.

The third table assigns each attribute an id number which will fit in a single byte, and that id is used to replace the 25K copies of the attribute name. This will be a sizable reduction in stored data for all but the shortest names. It also allows other definition data to be stored with each attribute name, such as the type column in the table above.

There are also several indexes which will speed up queries and enforce uniqueness constraints on the data.

Now you can quickly locate all the attributes of a file or get all the files that have a value of a certain attribute that fall within a range of values using SQL to join the tables.

   -- get all attributes of a file
   select AttributeDefinition.Name, Attribute.Value
   from File
   join Attribute on Attribute.FileId = File.Id
   join AttributeDefinition on Attribute.AttId = AttributeDefinition.Id
   where File.Name = :file;

   -- get all files with an attribute in a range of values
   select File.Name
   from AttributeDefinition
   join Attribute on Attribute.AttId = AttributeDefinition.Id
   join File on Attribute.FileId = File.Id
   where AttributeDefinition.Name = :attribute
   and Attribute.Value between :min and :max;

You now have the full power of SQL available to extract information from the data. You can find the min and max values of a certain attribute, for example the modification date. You can find all the files with a particular value of an attribute, such as a permission settings. You can count the files that have one attribute but not another, etc.

If you give this a try, you might be surprised at how flexible it is.

Just because something works doesn't mean that it can't be improved.

Dennis Cote

Reply via email to