"Rob McDonald" <[EMAIL PROTECTED]> wrote on 06/24/2005 10:37:39 AM:
> I'm running MySQL 4.0.18-nt & accessing the database primarily through Java > with MySQL/Connector J (similar vintage). I am administering this myself, > and can upgrade as needed. I know I'm a bit behind the times. > I apologize for a bit of background, skip to the end for the question.... > I am using a database to track large numbers of computer analysis runs. The > analyses are treated abstractly as black boxes (tasks) with n-inputs and > m-outputs. Inputs and outputs are both treated as quantities. At the time > of the database design, we have no idea how many inputs or outputs any task > will have. An output of one task may be an input to another task. > So, there are tables for tasks (T) and quantities (Q), and then input T/Q > and output T/Q tables to create the many-to-many pairing required. > As cases are run, records are entered into a case (C) table, and the > settings of the inputs and outputs from the run are stored in pairs in a > Case/Quantity table. > The general idea is to keep a record of many computer analysis runs and > interpolate new results instead of running the code again. So, if the > computer analysis is something like... > f = sin(x)*cos(y) > And we'll pretend that this is a very expensive analysis to perform, we want > to get as much out of the runs we've already made as possible. Of course, > in the real world, there may be a dozen or twenty input quantities and a > similar number of outputs... > I need to be able to perform nearest-neighbor type searches based on the > Case/Quantity values for the inputs. My plan was to just request _all_ the > data from the database (making the query easy), storing the data in a K-D > Tree in my application, performing the more detailed searches there. Then, > I read a random (ancient) blog entry about R-Trees in MySQL 4.1 > (http://jeremy.zawodny.com/blog/archives/000418.html). Which got me > thinking.... Is there some way to more effectively store this type of data > in MySQL. I don't need R-Trees (though they'll work), my data are all > points, not rectangles... > So, is there some way to tell MySQL that I'll be doing these indirect > searches on the values of the quantities, so that MySQL can do this much > more efficiently? I'd rather not change my database schema, but will > consider it if required. > Recall that we don't know the input dimensionality of the problem at the > time the database is created. > Thanks in advance, > Rob If I remember right, the R-TREES are associated with the GIS extensions to MySQL. I could be wrong but that's how I remember it (and I had a hard time finding a reference to them in the official online manual. Can anyone help?) I guess as long as your _results_ are points of no more than two dimensions, you could try using the spatial extensions to store them. However, you make it sound as though you have an N-dimensional input space and an M-dimensional output space. I don't think the GIS extensions to MySQL cover the cases where M or N are greater than 2. from http://dev.mysql.com/doc/mysql/en/gis-class-geometry.html >>>>>>>>>>>>>>> (Describing the Geometry class) ... All calculations are done assuming Euclidean (planar) geometry. # Its coordinates in its Spatial Reference System, represented as double-precision (eight-byte) numbers. All non-empty geometries include at least one pair of (X,Y) coordinates. Empty geometries contain no coordinates. ... A geometry can have a dimension of ?1, 0, 1, or 2 ... <<<<<<<<<<<<<<< If I am right, you are asking for help to determine all of the points in an X-dimensional space that reside within a certain radius of a particular point, or the nearest C points to a particular location in X-dimensional space. One way to minimize your search targets T(0...n) is to define an X-dimensional hypercube around your target point P(0) by taking your original coordinates and looking +/- some value for each dimension. Then, you would compute the distance from your P(0) to each T(x), sort the results and pick the "nearest" by limiting to however many you wanted. In my opinion, you can store the data two different ways: flat or normalized. In the flat model, you create one column for each dimension. This is sometimes faster to work with but takes up more room. The down-side to this arrangement is: should you ever need to increase the number of dimensions you are storing, you will have to change the table structure to do it. The normalized model creates list-pairs (I think this is what you are doing) and can consume as much room as the flat model, depending on the density of your high dimensional data. What I mean is that if you create a "results" table to store up to 25 dimensions, and most of your data only had 8, you have 17 "blank" dimensions being stored for each row of data. This kind of situation would be optimized by using the Normalized data method. However, if you frequently store 20 dimensions or more for each data point then the flat table is more space efficient because you eliminate the duplication of the "parent_id" field for each row. As I said, it's all dependent on how flexible you need your results to be and the nature of your data that will determine which method is more flexible. The flat table model has another drawback, there is a limit to how many columns you can index on any one table. With the normalized data, you shouldn't run into that problem. Am I on the right track or have I lost my way? Shawn Green Database Administrator Unimin Corporation - Spruce Pine