"Rob McDonald" <[EMAIL PROTECTED]> wrote on 06/24/2005 10:37:39 AM:

> I'm running MySQL 4.0.18-nt & accessing the database primarily through 
Java
> with MySQL/Connector J (similar vintage).  I am administering this 
myself,
> and can upgrade as needed.  I know I'm a bit behind the times.

> I apologize for a bit of background, skip to the end for the 
question....

> I am using a database to track large numbers of computer analysis runs. 
The
> analyses are treated abstractly as black boxes (tasks) with n-inputs and
> m-outputs.  Inputs and outputs are both treated as quantities.  At the 
time
> of the database design, we have no idea how many inputs or outputs any 
task
> will have.  An output of one task may be an input to another task.

> So, there are tables for tasks (T) and quantities (Q), and then input 
T/Q
> and output T/Q tables to create the many-to-many pairing required.

> As cases are run, records are entered into a case (C) table, and the
> settings of the inputs and outputs from the run are stored in pairs in a
> Case/Quantity table.

> The general idea is to keep a record of many computer analysis runs and
> interpolate new results instead of running the code again.  So, if the
> computer analysis is something like...

> f = sin(x)*cos(y)

> And we'll pretend that this is a very expensive analysis to perform, we 
want
> to get as much out of the runs we've already made as possible.  Of 
course,
> in the real world, there may be a dozen or twenty input quantities and a
> similar number of outputs...

> I need to be able to perform nearest-neighbor type searches based on the
> Case/Quantity values for the inputs.  My plan was to just request _all_ 
the
> data from the database (making the query easy), storing the data in a 
K-D
> Tree in my application, performing the more detailed searches there. 
Then,
> I read a random (ancient) blog entry about R-Trees in MySQL 4.1
> (http://jeremy.zawodny.com/blog/archives/000418.html).  Which got me
> thinking....  Is there some way to more effectively store this type of 
data
> in MySQL.  I don't need R-Trees (though they'll work), my data are all
> points, not rectangles...

> So, is there some way to tell MySQL that I'll be doing these indirect
> searches on the values of the quantities, so that MySQL can do this much
> more efficiently?  I'd rather not change my database schema, but will
> consider it if required.

> Recall that we don't know the input dimensionality of the problem at the
> time the database is created.

> Thanks in advance,

> Rob

If I remember right, the R-TREES are associated with the GIS extensions to 
MySQL. I could be wrong but that's how I remember it (and I had a hard 
time finding a reference to them in the official online manual. Can anyone 
help?)

I guess as long as your _results_ are points of no more than two 
dimensions, you could try using the spatial extensions to store them. 
However, you make it sound as though you have an N-dimensional input space 
and an M-dimensional output space. I don't think the GIS extensions to 
MySQL cover the cases where M or N are greater than 2.

from http://dev.mysql.com/doc/mysql/en/gis-class-geometry.html
>>>>>>>>>>>>>>> (Describing the Geometry class)
...

All calculations are done assuming Euclidean (planar) geometry.

# Its coordinates in its Spatial Reference System, represented as 
double-precision (eight-byte) numbers. All non-empty geometries include at 
least one pair of (X,Y) coordinates. Empty geometries contain no 
coordinates.

...

A geometry can have a dimension of ?1, 0, 1, or 2
...
<<<<<<<<<<<<<<<

If I am right, you are asking for help to determine all of the points in 
an X-dimensional space that reside within a certain radius of a particular 
point, or the nearest C points to a particular location in X-dimensional 
space. 

One way to minimize your search targets T(0...n) is to define an 
X-dimensional hypercube around your target point P(0) by taking your 
original coordinates and looking +/- some value for each dimension. Then, 
you would compute the distance from your P(0) to each T(x), sort the 
results and pick the "nearest" by limiting to however many you wanted. 

In my opinion, you can store the data two different ways: flat or 
normalized. In the flat model, you create one column for each dimension. 
This is sometimes faster to work with but takes up more room. The 
down-side to this arrangement is: should you ever need to increase the 
number of dimensions you are storing, you will have to change the table 
structure to do it. The normalized model creates list-pairs (I think this 
is what you are doing) and can consume as much room as the flat model, 
depending on the density of your high dimensional data. 

What I mean is that if you create a "results" table to store up to 25 
dimensions, and most of your data only had 8, you have 17 "blank" 
dimensions being stored for each row of data. This kind of situation would 
be optimized by using the Normalized data method.  However, if you 
frequently store 20 dimensions or more for each data point then the flat 
table is more space efficient because you eliminate the duplication of the 
"parent_id" field for each row.

As I said, it's all dependent on how flexible you need your results to be 
and the nature of your data that will determine which method is more 
flexible.

The flat table model has another drawback, there is a limit to how many 
columns you can index on any one table. With the normalized data, you 
shouldn't run into that problem.

Am I on the right track or have I lost my way?

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine

Reply via email to