Tito Ciuro wrote:
I have no idea why you're so angry. Anyway, there are so many things
I can think of saying, I'll just make it brief and to the point.
This thing won't scale. I'd like to see it when you have the 4.5
million records my database contains,
and that is still tiny for all intents-and-purposes.
For the type of application I'm building, it doesn't need to scale.
At most I'll have 25K records. Even with the 100K database I've used
for testing, it simply flies.
I'm not building anything for the outside world. Just something that
serves me well.
I'm storing variable-length data, with a very different set of
attributes. Some may have 1 attribute, others tens of them. Perhaps
even a hundred. Using a column per attribute is not a good idea. A
few days ago I asked this question and Dr. Hipp mentioned:
The more tables you have, the slower the first query will run
and the more memory SQLite will use. For long-running applications
where the startup time is not a significant factor, 100s or
1000s of tables is fine. For a CGI script that starts itself
up anew several times per second, then you should try to keep
the number of tables below a 100, I think. Less than that if
you can. You should also try and keep down the number of tables
in low-memory embedded applications, in order to save on memory
usages. Each table takes a few hundred bytes of memory - depending
on the number of columns and features.
Having two columns (one for the key and the other one for the data
itself) seems like a good balance between speed and ease of use. I
don't care if it doesn't scale because the intended deployment is 25K
at most, as I said earlier. Even with data sets 4x that size works fine.
There is absolutely no reason this could not be properly designed
and STILL fit any possible need for that ugly packed record.
Tito,
While the tone of MGC's post may have been a little over the top, many
of his points were valid.
All he was really suggesting is that a set of normalized tables would do
the same thing, and probably preform better, especially when scaled to
larger tables.
In your application you have 25K files each with 1 to 100 attributes.
You can still store your data in two normalized tables.
The main idea is to use more than one row in the value table for each
file. These values are all tied to the file name by a common file id
value. This table schema is shown below.
-- basic two table attribute setup
create table File (
Id integer primary key;
Name text; -- name of file
including path
);
create table Attribute (
FileId integer references File(Id); -- foreign key
Name text; -- name of attribute
Value text; -- value of attribute
primary key (FileId, Name);
);
create index AttributeValue on Attribute(Name, Value);
With these tables you will have 25K rows in the File table, one per
file, and 250K rows in the Attribute table assuming an average of 10
attributes per file (your example had 14).
An index on the attribute name and value pairs will greatly speedup
lookups of a particular attribute.
There is still some redundancy in this set of tables since the names of
the attributes are being repeated over and over in the Attribute table.
This may not be a big problem, but you really only have 100 attributes,
so there is no sense storing 25K copies of the name of a common
attribute that applies to all files. This information can be factored
out into a third table as shown below.
-- Factor out the attribute definitions into a separate table
create table File (
Id integer primary key;
Name text; -- name of file
including path
);
create table AttributeDefinition (
Id integer primary key;
Name text unique; -- name of the
attribute
Type text; -- type of the attribute
);
create table Attribute (
FileId integer references File(Id);
AttId integer references AttributeDefinition(Id);
Value text; -- value of attribute
primary key (FileId, AttId);
);
create index AttributeValue on Attribute(AttId, Value);
This will give you three tables (which will not cause a noticeable
difference in the database open time). One with 25K rows of file names.
One with 100 rows of attribute definition data, which includes the
attribute name. And one with 250K rows of attribute value data.
The third table assigns each attribute an id number which will fit in a
single byte, and that id is used to replace the 25K copies of the
attribute name. This will be a sizable reduction in stored data for all
but the shortest names. It also allows other definition data to be
stored with each attribute name, such as the type column in the table above.
There are also several indexes which will speed up queries and enforce
uniqueness constraints on the data.
Now you can quickly locate all the attributes of a file or get all the
files that have a value of a certain attribute that fall within a range
of values using SQL to join the tables.
-- get all attributes of a file
select AttributeDefinition.Name, Attribute.Value
from File
join Attribute on Attribute.FileId = File.Id
join AttributeDefinition on Attribute.AttId = AttributeDefinition.Id
where File.Name = :file;
-- get all files with an attribute in a range of values
select File.Name
from AttributeDefinition
join Attribute on Attribute.AttId = AttributeDefinition.Id
join File on Attribute.FileId = File.Id
where AttributeDefinition.Name = :attribute
and Attribute.Value between :min and :max;
You now have the full power of SQL available to extract information from
the data. You can find the min and max values of a certain attribute,
for example the modification date. You can find all the files with a
particular value of an attribute, such as a permission settings. You can
count the files that have one attribute but not another, etc.
If you give this a try, you might be surprised at how flexible it is.
Just because something works doesn't mean that it can't be improved.
Dennis Cote