Re: DBD::Pg support for builtin array types

Flemming Frandsen Fri, 16 Nov 2001 08:37:32 -0800

"Jason E. Stewart" wrote:
> 
> "Flemming Frandsen" <[EMAIL PROTECTED]> writes:
> 
> > "Jason E. Stewart" wrote:
> >
> > > True. I just wanted to have a way to do it in C so that it would be
> > > fast. I want to access arrays of 6 floats in tables with a million
> > > rows, doing that using perl's split() and join() is going to be slow.
> >
> > Why use arrays at all?
> >
> > Why not normalize your data (put the n floats in another table) or add
> > the 6 floats to the rows you are dealing with?
> 
> Hey Flemming,
> 
> I gave a ridiculously simple example just as an idea. I would *love*
> to be able to normalize my data, but I'm having real trouble figuring
> it out. Here's the problem I'm trying to solve.
> 
> I'm building a tool that will enable scientists to load there
> experimental data into a database. That data will come as spreadsheets
> of data from scientists. Each group of scientists will use slightly
> different technology to generate the data so those spreadsheets are
> very likely to have different numbers of columns, and they will
> certainly have different data types in the various columns, however
> from one group, all the data should have the same format (or small set
> of formats).
> 
> So whatever solution I come up with, needs to be flexible and store
> data no matter how many columns it has or what the data types for the
> fields are. One complication is that there are likely to be millions
> of rows from the spreadsheets, so I want it to be reasonable efficient
> (no joins if possible).
> 
> Variable length arrays seemed the obvious way to solve this.
> 
> I just wanted to avoid having to create a new table for each
> spreadsheet configuration. A small finite number of tables would be
> fine, but I couldn't come up with a way.
> 
> If you have other ideas, I'd love to hear them.


5 tables:

2 to describe the format:
format
 id
 name

field
 id
 format_id
 name


3 to hold the data:
dataset
 id
 format_id

sample 
 id
 dataset_id
 
value
 sample_id
 field_id
 value

*yes* this means that you will need to join value with sample to get the
rows (aka samples) of a dataset.

This may seem like a lot of work and you may think it's going to be
slow, but *do it right* first, then optimize if it turns out to be too
slow.

If you need to optimize you will have done your software design so well
that you can optimize the database use without impacting the rest of the
design.

If your database cannot scale to many millions of tuples in the value
table, then throw it out and use a database that works, I'd recomend SAP
DB, it's under GPL and it's big enough to handle running SAP R/3 on it
(which has many gigs of data in 16000+ tables in a fresh install)

In short: Do your design right, then fuck it up later.

-- 
 Regards Flemming Frandsen aka. Dion/Swamp http://dion.swamp.dk

Re: DBD::Pg support for builtin array types

Reply via email to