Re: Most efficient way of handling a large dataset

Mark Goodge Sat, 25 Oct 2008 08:45:46 -0700

Joerg Bruehe wrote:

Hi Mark, all!



Mark Goodge wrote:

I'd appreciate some advice on how best to handle a biggish dataset
consisting of around 5 million lines. At the moment, I have a single
table consisting of four fields and one primary key:

partcode varchar(20)
region varchar(10)
location varchar(50)
qty int(11)
PRIMARY KEY (partcode, region, location)

The biggest variable is partcode, with around 80,000 distinct values.
For statistical purposes, I need to be able to select a sum(qty) based
on the other three fields (eg, "select sum(qty) from mytable where
partcode ='x' and region = 'y' and location = 'z'") as well as
generating a list of partcodes and total quantities in each region and
location (eg, "select sum(qty), partcode from mytable where region = 'y'
and location = 'z' group by partcode").


Sorry, I don't get it - I think there is a contradiction in your mail:

Your table has four fields, three of which form the primary key.
This means that for any combination of values for those three fields
(partcode, region, location) there will be at most one row, and so only
one qty value. Why do you want to sum over one value?

Sorry, my mistake. I mean that I need to be able to sum any combinationof two from three of the first three fields.

And in your second query there is also no need for "sum(qty)", a plain
"qty" will do because for each group there will be only one row (region
and location are set to fixed values in your statement).

IMO, the main question is whether all your statements use fixed values
for region and location (like your second statement), but only some do
for partcode (your first statement).


No; any of the three can be fixed or not, as the case may be.

If that holds true for your statements, then your primary key is defined
in the wrong order: it should have partcode as the last field (= the
least significant one). Then, all your statement could use the primary
key, and you need no separate index.

[[...]]

Does anyone have any suggestions? My initial thought is to replace the
region and location varchar fields with int fields keyed to a separate
list of region and location names. Would that help, or is there a better
way?


Well, if your data are integer values, then using integer as column type
should speed up your operations considerably:
Operations (including comparisons) on integers are faster than on
character strings, and reduced data size means shorter (= faster)
transfers and more elements in caches (assuming same cache size).

At the moment, only the qty is an integer value. The others are strings(a variable-length alphanumeric for partcode, and county/town names forregion and location). For presentation reasons, I need to display thehuman-readable names of the region and location, but this could be doneby means of having the names in a separate table with integer ids that'sjoined to the main table when querying.


Mark
--
http://mark.goodge.co.uk - my pointless blog
http://www.good-stuff.co.uk - my less pointless stuff

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Most efficient way of handling a large dataset

Reply via email to