"Jason E. Stewart" wrote: > > "Flemming Frandsen" <[EMAIL PROTECTED]> writes: > > > "Jason E. Stewart" wrote: > > > > > True. I just wanted to have a way to do it in C so that it would be > > > fast. I want to access arrays of 6 floats in tables with a million > > > rows, doing that using perl's split() and join() is going to be slow. > > > > Why use arrays at all? > > > > Why not normalize your data (put the n floats in another table) or add > > the 6 floats to the rows you are dealing with? > > Hey Flemming, > > I gave a ridiculously simple example just as an idea. I would *love* > to be able to normalize my data, but I'm having real trouble figuring > it out. Here's the problem I'm trying to solve. > > I'm building a tool that will enable scientists to load there > experimental data into a database. That data will come as spreadsheets > of data from scientists. Each group of scientists will use slightly > different technology to generate the data so those spreadsheets are > very likely to have different numbers of columns, and they will > certainly have different data types in the various columns, however > from one group, all the data should have the same format (or small set > of formats). > > So whatever solution I come up with, needs to be flexible and store > data no matter how many columns it has or what the data types for the > fields are. One complication is that there are likely to be millions > of rows from the spreadsheets, so I want it to be reasonable efficient > (no joins if possible). > > Variable length arrays seemed the obvious way to solve this. > > I just wanted to avoid having to create a new table for each > spreadsheet configuration. A small finite number of tables would be > fine, but I couldn't come up with a way. > > If you have other ideas, I'd love to hear them.
5 tables: 2 to describe the format: format id name field id format_id name 3 to hold the data: dataset id format_id sample id dataset_id value sample_id field_id value *yes* this means that you will need to join value with sample to get the rows (aka samples) of a dataset. This may seem like a lot of work and you may think it's going to be slow, but *do it right* first, then optimize if it turns out to be too slow. If you need to optimize you will have done your software design so well that you can optimize the database use without impacting the rest of the design. If your database cannot scale to many millions of tuples in the value table, then throw it out and use a database that works, I'd recomend SAP DB, it's under GPL and it's big enough to handle running SAP R/3 on it (which has many gigs of data in 16000+ tables in a fresh install) In short: Do your design right, then fuck it up later. -- Regards Flemming Frandsen aka. Dion/Swamp http://dion.swamp.dk
