Hey all,

I've been experimenting with the best way to store a large (~100GB) of data for 
retrieval. Essentially, I'm storing 9 variables for approximately 1,000,000 locations 
a day for the last ten years. This can work out at around 4MB a variable a day - but 
not all variables are always present (could be null).

(If you don't care about the details, I'll summarize at the end of this email).


Inserting and retrieving from a MyISAM table seemed to be approaching impossible. (Not 
totally surprising.) I originally had a table along the lines of:

create table basic_daily_report(
date DATE NOT NULL,
location_id MEDIUMINT UNSIGNED NOT NULL,
variable1 float,
variable2 float....
variable9 float,
primary key (date, location_id)
);

(Just a summary of the actual table)

With this I had a maxiumum table size of around 100GB - just barely enough to do it. I 
expected I would end up segmenting by year, or something similar, as ugly as that is. 
I tested InnoDB as an alternative to this, but we'll get to that in a second. 
Basically, inserting a day's worth of data would take ages, and pretty much require an 
analyze table for a couple hours every morning. Selecting was getting to be pretty 
slow, as well. Eventually, I hit on the idea of including one row per day:


create table basic_daily_grid(
date DATE NOT NULL PRIMARY KEY,
variable1 MEDIUMBLOB,
variable2 MEDIUMBLOB ...
variable9 MEDIUMBLOB
);

And wrote a UDF such that you pass it the variable and a location, and it'll return 
the exact value. This works well because every day has a constant number of locations 
in a grid format, so it's simply an array lookup. So, 

select grid_point(location_id, variable1) from basic_daily_grid where date=20030101

would return the right value for locationid. It turns out this is almost (95%) as fast 
as the first version in selecting, but it has the added bonus of inserts now only take 
~5 seconds per day!




Alas, after inserting 260 days (less than a year) I hit the MyISAM table size limit of 
4.2GB - because a BLOB is a variable length field.


-----------------------------------------------------------------


I mention all the above in case someone has an alternative solution I'm looking over. 
Possible solutions I've found are a) use InnoDB instead, b) work with the source to 
create a new field type BLOBARRAY of a constant width instead of dynamic, c) work with 
the source to somehow overcome the 4.2GB limit on a dynamic table.


c) Seems unlikely - if the actual developers can't do it, I probably can't
b) Seems possible, I assume no one saw a need for a constant width column of 4MB, so 
hopefully its not too difficult
a) Was my first try. Inserting takes about twice as long as myisam... sure, I can deal 
with that. Selecting a specific date is in the same ballpark as well, so little 
problem there. What I'm having severe performance issues on are querys that group, or 
do a count(*). For example:


mysql> explain select date from basic_daily_grid_innodb;
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
| basic_daily_grid_innodb | index | NULL | PRIMARY | 3 | NULL |301 | Using index |
+-------------------------+-------+---------------+---------+---------+------+------+-------------+




mysql> select date from basic_daily_grid_innodb;
...
317 rows in set (0.00 sec)

mysql> explain select date, count(*) from basic_daily_grid_innodb group by date;
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
| table | type | possible_keys | key | key_len | ref | rows | Extra |
+-------------------------+-------+---------------+---------+---------+------+------+-------------+
| basic_daily_grid_innodb | index | NULL | PRIMARY | 3 | NULL |301 | Using index |
+-------------------------+-------+---------------+---------+---------+------+------+-------------+


mysql> explain select date, count(*) from basic_daily_grid_innodb group by date;
...
317 rows in set (2 min 54.95 sec)


I assume this is due to versioning or some other transactional feature. Or, is this a 
bug, or am I doing something wrong? I don't quite see why grouping items that are all 
unique should be that much slower than not grouping. I need InnoDB for the unlimited 
table size, but I don't (Really) need transactions, commit/rollback, or checkpoints.


Any suggestions on solving this last hurdle? Its entirely likely I'll need to group by 
year and average the results, or something similar - and at 3 minutes a pop, thats a 
little high for my liking. Perhaps I'm expecting too much?


Thanks,
Nick Elliott

Reply via email to