On Jan 6, 2015 3:12 PM, "Michael Heaney" <mhea...@jcvi.org> wrote: > > I'm fairly new to Postgres, and have a design issue for which an array of integers might be a good solution. But I'd like to hear from the experts before proceeding down this path. > > Essentially, I'm trying to model the relationship between a group of biological samples and their genes. Each sample (there are ~10K of them now, with more coming) will have about 30,000 genes. Conversely, a particular gene may be present in almost all samples. > > So I've created the following table to handle the many-to-many relationship: > > create table sample_gene (id serial, sample_id int, gene_id int); > > which looks like this when populated: > > sample_id | gene_id > --------------------------------------- > 1 | 1 > 1 | 2 > ... > 1 | 30475 > 2 | 1 > 2 | 2 > ... > 2 | 29973 > 3 | 1 > etc. > > The table now contains hundreds of millions of rows (with many, many more to come). Join performance between samples and genes is quite slow, even with indexes on sample_id and gene_id. > > So it occurred to me: why not eliminate all the duplicate sample_id values by storing the gene_id's in an array, like so: > > create table sample_gene_array (id serial, sample_id int, gene_id int [] ); > > So now the table data looks like this: > > sample_id | gene_id [] > --------------------------------------- > 1 | [1:30475] > 2 | [1:29973] > etc. > > The new table is significantly smaller, and performance (using ANY[] ) is quite good. Nevertheless, I'm uneasy. I come from a Sybase ASE background, and so have no experience with arrays as datatypes. Is it okay to store 30K+ gene values in an array in the linking table (or maybe even in the sample table itself, thus eliminating the linking table)? Should I unnest the gene_id's first, before using them to join to the gene table? > > TIA for any guidance you can provide. Again, I'm a Postgres neophyte - but I'm in awe of the power and flexibility of this database, and wish that I'd started using it sooner. > > ------ > Michael Heaney > JCVI > > > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
Please provide sample queries so we can understand how you query the data.