subject:"Re\: \[GENERAL\] Best way to handle multi\-billion row read\-only table\?"

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-10 Thread Asher


John R Pierce wrote:
how do you plan on accessing this monster data?  do you expect to be 
looking up single values or small set of values at a specific time? 
seems to me like this is the sort of data thats more often processed in 
the aggregate, like running a fourier analysis of sliding windows, and 
that sort of data processing may well be more efficiently done with 
fixed block binary files rather than relational databases, as there's no 
real relationships in this data.


The data will initially be accessed via a simple GUI which will allow 
browsing over a subset of the data (subsampled down to 1 
sample/minute/hour, etc. during the data load phase and so massively 
smaller datasets) and then once something interesting has been found 
manually (fully automatic detection of transients has to wait until 
we've formally described what we mean by transient :-)) the start and 
end times can be handed over to our automatic processing code to go 
through the full dataset.


I did consider just sticking the data into a series of big dumb files 
but by putting them in a DB I can both maintain automatic links between 
the full and subsampled data sets and between each data point and the 
equipment that measured it and, possibly more importantly, I can provide 
a simpler interface to the other people on my project to access the 
data. I'm a computer scientist but I'm doing my PhD in the Civil 
Engineering dept and all of my colleagues are civil engineers - all 
quite happy using Matlab's database plugin but less happy writing 
traditional code to crunch through raw files. I'm aware that I'm taking 
a, possibly quite large, performance hit by using a database but I'm 
hoping that the advantages will outweigh this.


Many thanks for all the replies to my query. I'm going to go with a 
partitioned table design and start uploading some data. I'll post how it 
performs once I've got some real size data in it.



Asher.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-10 Thread Greg Stark

On Wed, Feb 10, 2010 at 2:32 PM, Asher as...@piceur.co.uk wrote:
 The data will initially be accessed via a simple GUI which will allow
 browsing over a subset of the data (subsampled down to 1 sample/minute/hour,
 etc.

It sounds like you could use a tool like rrd that keeps various levels
of aggregation and intelligently chooses the right level for the given
query. I think there are such tools though I'm not sure there are any
free ones.

-- 
greg

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-10 Thread Vincenzo Romano

2010/2/10 Greg Stark gsst...@mit.edu:
 On Wed, Feb 10, 2010 at 2:32 PM, Asher as...@piceur.co.uk wrote:
 The data will initially be accessed via a simple GUI which will allow
 browsing over a subset of the data (subsampled down to 1 sample/minute/hour,
 etc.

 It sounds like you could use a tool like rrd that keeps various levels
 of aggregation and intelligently chooses the right level for the given
 query. I think there are such tools though I'm not sure there are any
 free ones.

Use as much memory as possible to fit indexes as well as portions of
the table space itself in RAM.
Of course, poor indexing can kill any effort.

-- 
Vincenzo Romano
NotOrAnd Information Technologies
NON QVIETIS MARIBVS NAVTA PERITVS

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread Justin Graf

On 2/9/2010 12:47 PM, Asher wrote:
 Hello.

 I'm putting together a database to store the readings from various 
 measurement devices for later processing. Since these things (water 
 pressure monitors attached to very large water pipes) take readings at 
 200Hz and are typically deployed over multiple sites for several 
 months at a time I've got many billions of rows of data, each (at the 
 moment) with the following simple format:

 value REAL NOT NULL,
 sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
 channel INTEGER REFERENCES channel(id) NOT NULL

 (Where the channel table contains metadata to identify the 
 particular sensor, data logger, etc. used to obtain the data and the 
 combination of channel and sample_time is unique.)



Well first is that 200hz  meaning 200 samples per channel per second.   
That is very fast sampling for pressure sensor,  I would be surprised if 
the meters  are actually giving real results at that rate.  I would look 
at reducing that down to what the meter is actual capable of sending  
What kind of AD card is being used as this effects what makes sense to 
record.

I would look into table partitioning
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
http://wiki.postgresql.org/wiki/Table_partitioning

A one big index for such a small record will not be a big win because 
the index are going to be the same size as table.
Look into limiting the number of records each index covers.
http://www.postgresql.org/docs/8.4/static/sql-createindex.html

All legitimate Magwerks Corporation quotations are sent in a .PDF file 
attachment with a unique ID number generated by our proprietary quotation 
system. Quotations received via any other form of communication will not be 
honored.

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally 
privileged, confidential or other information proprietary to Magwerks 
Corporation and is intended solely for the use of the individual to whom it 
addresses. If the reader of this e-mail is not the intended recipient or 
authorized agent, the reader is hereby notified that any unauthorized viewing, 
dissemination, distribution or copying of this e-mail is strictly prohibited. 
If you have received this e-mail in error, please notify the sender by replying 
to this message and destroy all occurrences of this e-mail immediately.
Thank you.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread Alex Thurlow

I've only gone up to about a billion rows, but table partitioning seems 
to be the way to go to me.  I did per-day partitioning, and just had the 
job that inserts the daily data add the table automatically.  With the 
partitioning, it only has to pull up the tables for the specific days, 
and is therefore a kind of date index already.


Alex Thurlow
Blastro Networks

http://www.blastro.com
http://www.roxwel.com
http://www.yallwire.com


On 2/9/2010 11:47 AM, Asher wrote:

Hello.

I'm putting together a database to store the readings from various 
measurement devices for later processing. Since these things (water 
pressure monitors attached to very large water pipes) take readings at 
200Hz and are typically deployed over multiple sites for several 
months at a time I've got many billions of rows of data, each (at the 
moment) with the following simple format:


value REAL NOT NULL,
sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
channel INTEGER REFERENCES channel(id) NOT NULL

(Where the channel table contains metadata to identify the 
particular sensor, data logger, etc. used to obtain the data and the 
combination of channel and sample_time is unique.)


Once loaded into the database the data will never be deleted or 
modified and will typically be accessed over a particular date range 
for a particular channel (e.g. sample_time = X AND sample_time = Y 
AND channel=Z). A typical query won't return more than a few million 
rows and speed is not desperately important (as long as the time is 
measured in minutes rather than hours).


Are there any recommended ways to organise this? Should I partition my 
big table into multiple smaller ones which will always fit in memory 
(this would result in several hundreds or thousands of sub-tables)? 
Are there any ways to keep the index size to a minimum? At the moment 
I have a few weeks of data, about 180GB, loaded into a single table 
and indexed on sample_time and channel and the index takes up 180GB too.


Since this is all for a typically budget-restricted PhD project the 
hardware is just a high-end desktop workstation with (at the moment) 
2*2TB drives organised into a single 4TB partition using FreeBSD's 
vinum system.



Many thanks for any help,
Asher.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread Asher Hoskins


Justin Graf wrote:
Well first is that 200hz  meaning 200 samples per channel per second.   
That is very fast sampling for pressure sensor,  I would be surprised if 
the meters  are actually giving real results at that rate.  I would look 
at reducing that down to what the meter is actual capable of sending  
What kind of AD card is being used as this effects what makes sense to 
record.


Yes, we really are measuring at 200 samples per second. We're trying to 
capture high resolution images of pressure transients as they move along 
water distribution pipelines (the backbones of the water network, 
typically 4'-6' in diameter, carrying 500-1000 litres/second) to 
understand how they travel and what stress they put upon the pipe. We're 
using custom data loggers at the moment based around Intel iMote2 Linux 
systems with a high-speed QuickFilter ADC (and the sensors we're using 
can cope at 200Hz).



I would look into table partitioning
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
http://wiki.postgresql.org/wiki/Table_partitioning


Thanks for that, it looks like partitioning is the way to go. I'm 
assuming that I should try and keep my total_relation_sizes less than 
the memory size of the machine?


A one big index for such a small record will not be a big win because 
the index are going to be the same size as table.

Look into limiting the number of records each index covers.
http://www.postgresql.org/docs/8.4/static/sql-createindex.html


If I partition so that each partition holds data for a single channel 
(and set a CHECK constraint for this) then I can presumably remove the 
channel from the index since constraint exclusion will mean that only 
partitions holding the channel I'm interested in will be searched in a 
query. Given that within a partition all of my sample_time's will be 
different do you know if there's a more efficient way to index these?



Many thanks,
Asher


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread Scott Marlowe

On Tue, Feb 9, 2010 at 10:47 AM, Asher as...@piceur.co.uk wrote:
 Hello.

 I'm putting together a database to store the readings from various
 measurement devices for later processing. Since these things (water pressure
 monitors attached to very large water pipes) take readings at 200Hz and are
 typically deployed over multiple sites for several months at a time I've got
 many billions of rows of data, each (at the moment) with the following
 simple format:

        value REAL NOT NULL,
        sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
        channel INTEGER REFERENCES channel(id) NOT NULL

 (Where the channel table contains metadata to identify the particular
 sensor, data logger, etc. used to obtain the data and the combination of
 channel and sample_time is unique.)

 Once loaded into the database the data will never be deleted or modified and
 will typically be accessed over a particular date range for a particular
 channel (e.g. sample_time = X AND sample_time = Y AND channel=Z). A
 typical query won't return more than a few million rows and speed is not
 desperately important (as long as the time is measured in minutes rather
 than hours).

 Are there any recommended ways to organise this? Should I partition my big
 table into multiple smaller ones which will always fit in memory (this would
 result in several hundreds or thousands of sub-tables)?

Partitioning is the standard way.  Note that you can partition on 1
axis, or 2, or more.  In this case partitioning on time and channel
might make the most sense.

 Are there any ways
 to keep the index size to a minimum? At the moment I have a few weeks of
 data, about 180GB, loaded into a single table and indexed on sample_time and
 channel and the index takes up 180GB too.

It may be that with small enough partitions indexes aren't really
needed.  That's been the case for a lot of data I've worked with in
the past.

 Since this is all for a typically budget-restricted PhD project the hardware
 is just a high-end desktop workstation with (at the moment) 2*2TB drives
 organised into a single 4TB partition using FreeBSD's vinum system.

Partitioning should definitely help.  You might want to go with RAID-1
instead of RAID-0 since the read performance is similar under most
modern OSes.  I know linux now aggregates the two drives together to
read, I'd assume BSD does too.  That way you've got better reliability
and about the same performance.  Load times will be about double, but
that's a one time thing, right?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread John R Pierce


Asher Hoskins wrote:
If I partition so that each partition holds data for a single channel 
(and set a CHECK constraint for this) then I can presumably remove the 
channel from the index since constraint exclusion will mean that only 
partitions holding the channel I'm interested in will be searched in a 
query. Given that within a partition all of my sample_time's will be 
different do you know if there's a more efficient way to index these?


how do you plan on accessing this monster data?  do you expect to be 
looking up single values or small set of values at a specific time?  

seems to me like this is the sort of data thats more often processed in 
the aggregate, like running a fourier analysis of sliding windows, and 
that sort of data processing may well be more efficiently done with 
fixed block binary files rather than relational databases, as there's no 
real relationships in this data.


for instance, a directory for each sensor, with a directory for each 
week, and a file for each hour, containing the hours worth of samples in 
fixed binary blocks after a file header identifying it.  you can random 
access a specific time sample by using fseek 
(sampletime-starttimeofblock) * blocksize + headersize or whatever.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread Justin Graf

On 2/9/2010 4:41 PM, Asher Hoskins wrote:

 Thanks for that, it looks like partitioning is the way to go. I'm 
 assuming that I should try and keep my total_relation_sizes less than 
 the memory size of the machine?
This depends on what the quires look like.  As other have stated when 
partitioning you have to consider how the data is quired.



 If I partition so that each partition holds data for a single channel 
 (and set a CHECK constraint for this) then I can presumably remove the 
 channel from the index since constraint exclusion will mean that only 
 partitions holding the channel I'm interested in will be searched in a 
 query. Given that within a partition all of my sample_time's will be 
 different do you know if there's a more efficient way to index these?
Given the timestamp will most likely  be the where clause,  NO  on the 
plus side its only 8 bytes



All legitimate Magwerks Corporation quotations are sent in a .PDF file 
attachment with a unique ID number generated by our proprietary quotation 
system. Quotations received via any other form of communication will not be 
honored.

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally 
privileged, confidential or other information proprietary to Magwerks 
Corporation and is intended solely for the use of the individual to whom it 
addresses. If the reader of this e-mail is not the intended recipient or 
authorized agent, the reader is hereby notified that any unauthorized viewing, 
dissemination, distribution or copying of this e-mail is strictly prohibited. 
If you have received this e-mail in error, please notify the sender by replying 
to this message and destroy all occurrences of this e-mail immediately.
Thank you.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread BillR

Is it also possible to denormalize by putting the 'channel' data in the
first table (especially if it isn't very much)? Maintaining a foreign key
constraint can impact performance significantly in most RDBMS's, even when
deferring checking. I could be wrong, but I suspect PostgreSQL is no
different. Or keep the data normalized and  remove the constraint
altogether. Also remove any primary key constraint so that it doesn't have
to check uniqueness, and avoid as many indexes as you can. 

You have to take a leap of faith that you created your program well enough
to not get out of sync. 

I would be interested to hear comments on this. These are some of the things
we did on systems I have worked on running Oracle that handled even higher
volumes (tens to hundreds of thousands of transactions per second or
higher... sustained throughout the day at least on the lower volume).
Granted we had real heavy hardware but the DBAs forbade us to create
constraints and indexes etc. for this reason; except on less active tables.
Everyone has already talked about partitioning, but load balancing across
machines if you can afford a couple or few more could help too. Not sure
what facility Postgres has for this though (I would be interested to hear
comments on this too! :-)

BillR

-Original Message-
From: pgsql-general-ow...@postgresql.org
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Asher
Sent: February-09-10 12:47 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Best way to handle multi-billion row read-only table?

Hello.

I'm putting together a database to store the readings from various 
measurement devices for later processing. Since these things (water 
pressure monitors attached to very large water pipes) take readings at 
200Hz and are typically deployed over multiple sites for several months 
at a time I've got many billions of rows of data, each (at the moment) 
with the following simple format:

value REAL NOT NULL,
sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
channel INTEGER REFERENCES channel(id) NOT NULL

(Where the channel table contains metadata to identify the particular 
sensor, data logger, etc. used to obtain the data and the combination of 
channel and sample_time is unique.)

Once loaded into the database the data will never be deleted or modified 
and will typically be accessed over a particular date range for a 
particular channel (e.g. sample_time = X AND sample_time = Y AND 
channel=Z). A typical query won't return more than a few million rows 
and speed is not desperately important (as long as the time is measured 
in minutes rather than hours).

Are there any recommended ways to organise this? Should I partition my 
big table into multiple smaller ones which will always fit in memory 
(this would result in several hundreds or thousands of sub-tables)? Are 
there any ways to keep the index size to a minimum? At the moment I have 
a few weeks of data, about 180GB, loaded into a single table and indexed 
on sample_time and channel and the index takes up 180GB too.

Since this is all for a typically budget-restricted PhD project the 
hardware is just a high-end desktop workstation with (at the moment) 
2*2TB drives organised into a single 4TB partition using FreeBSD's vinum 
system.


Many thanks for any help,
Asher.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Spam/Virus scanning by CanIt Pro

For more information see
http://www.kgbinternet.com/SpamFilter.htm

To control your spam filter, log in at
http://filter.kgbinternet.com


-- 
BEGIN-ANTISPAM-VOTING-LINKS
--

Teach CanIt if this mail (ID 80270060) is spam:
Spam:
http://filter.kgbinternet.com/canit/b.php?i=80270060m=5d99840e72f9t=201002
09c=s
Not spam:
http://filter.kgbinternet.com/canit/b.php?i=80270060m=5d99840e72f9t=201002
09c=n
Forget vote:
http://filter.kgbinternet.com/canit/b.php?i=80270060m=5d99840e72f9t=201002
09c=f
--
END-ANTISPAM-VOTING-LINKS

__ Information from ESET Smart Security, version of virus signature
database 4852 (20100209) __

The message was checked by ESET Smart Security.

http://www.eset.com




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread Scott Marlowe

On Tue, Feb 9, 2010 at 8:06 PM, BillR iamb...@williamrosmus.com wrote:
 Is it also possible to denormalize by putting the 'channel' data in the
 first table (especially if it isn't very much)? Maintaining a foreign key
 constraint can impact performance significantly in most RDBMS's, even when
 deferring checking. I could be wrong, but I suspect PostgreSQL is no
 different. Or keep the data normalized and  remove the constraint
 altogether. Also remove any primary key constraint so that it doesn't have
 to check uniqueness, and avoid as many indexes as you can.

 You have to take a leap of faith that you created your program well enough
 to not get out of sync.

 I would be interested to hear comments on this. These are some of the things
 we did on systems I have worked on running Oracle that handled even higher
 volumes (tens to hundreds of thousands of transactions per second or
 higher... sustained throughout the day at least on the lower volume).
 Granted we had real heavy hardware but the DBAs forbade us to create
 constraints and indexes etc. for this reason; except on less active tables.
 Everyone has already talked about partitioning, but load balancing across
 machines if you can afford a couple or few more could help too. Not sure
 what facility Postgres has for this though (I would be interested to hear

I was under the impression the data was being gathered elsewhere and
then imported, so the insert performance isn't as critical as if it
was being done real time.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread Brent Wood

If you will be selecting sets of data within a time range, it should also 
improve performance if you can build a clustered index on the sample_time. It 
may also be worth looking at whether partitioning by timestamp  channel offers 
any advantages.

Brent Wood


Brent Wood
DBA/GIS consultant
NIWA, Wellington
New Zealand
 Justin Graf  02/10/10 3:07 PM 
On 2/9/2010 4:41 PM, Asher Hoskins wrote:

 Thanks for that, it looks like partitioning is the way to go. I'm 
 assuming that I should try and keep my total_relation_sizes less than 
 the memory size of the machine?
This depends on what the quires look like.  As other have stated when 
partitioning you have to consider how the data is quired.



 If I partition so that each partition holds data for a single channel 
 (and set a CHECK constraint for this) then I can presumably remove the 
 channel from the index since constraint exclusion will mean that only 
 partitions holding the channel I'm interested in will be searched in a 
 query. Given that within a partition all of my sample_time's will be 
 different do you know if there's a more efficient way to index these?
Given the timestamp will most likely  be the where clause,  NO  on the 
plus side its only 8 bytes



All legitimate Magwerks Corporation quotations are sent in a .PDF file 
attachment with a unique ID number generated by our proprietary quotation 
system. Quotations received via any other form of communication will not be 
honored.

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally 
privileged, confidential or other information proprietary to Magwerks 
Corporation and is intended solely for the use of the individual to whom it 
addresses. If the reader of this e-mail is not the intended recipient or 
authorized agent, the reader is hereby notified that any unauthorized viewing, 
dissemination, distribution or copying of this e-mail is strictly prohibited. 
If you have received this e-mail in error, please notify the sender by replying 
to this message and destroy all occurrences of this e-mail immediately.
Thank you.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

NIWA is the trading name of the National Institute of Water  Atmospheric 
Research Ltd.

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread Greg Smith


Asher wrote:
Once loaded into the database the data will never be deleted or 
modified and will typically be accessed over a particular date range 
for a particular channel (e.g. sample_time = X AND sample_time = Y 
AND channel=Z). A typical query won't return more than a few million 
rows and speed is not desperately important (as long as the time is 
measured in minutes rather than hours).


Are there any recommended ways to organise this? Should I partition my 
big table into multiple smaller ones which will always fit in memory 
(this would result in several hundreds or thousands of sub-tables)? 
Are there any ways to keep the index size to a minimum? At the moment 
I have a few weeks of data, about 180GB, loaded into a single table 
and indexed on sample_time and channel and the index takes up 180GB too.


One approach to consider is partitioning by sample_time and not even 
including the channel number in the index.  You've got tiny records; 
there's going to be hundreds of channels of data on each data page 
pulled in, right?  Why not minimize physical I/O by reducing the index 
size, just read that whole section of time in to memory (they should be 
pretty closely clustered and therefore mostly sequential I/O), and then 
push the filtering by channel onto the CPU instead.  If you've got 
billions of rows, you're going to end up disk bound anyway; minimizing 
physical I/O and random seeking around at the expense of CPU time could 
be a big win.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

2010-02-09 Thread Scott Marlowe

On Tue, Feb 9, 2010 at 11:51 PM, Greg Smith g...@2ndquadrant.com wrote:
 Asher wrote:

 Once loaded into the database the data will never be deleted or modified
 and will typically be accessed over a particular date range for a particular
 channel (e.g. sample_time = X AND sample_time = Y AND channel=Z). A
 typical query won't return more than a few million rows and speed is not
 desperately important (as long as the time is measured in minutes rather
 than hours).

 Are there any recommended ways to organise this? Should I partition my big
 table into multiple smaller ones which will always fit in memory (this would
 result in several hundreds or thousands of sub-tables)? Are there any ways
 to keep the index size to a minimum? At the moment I have a few weeks of
 data, about 180GB, loaded into a single table and indexed on sample_time and
 channel and the index takes up 180GB too.

 One approach to consider is partitioning by sample_time and not even
 including the channel number in the index.  You've got tiny records; there's
 going to be hundreds of channels of data on each data page pulled in, right?
  Why not minimize physical I/O by reducing the index size, just read that
 whole section of time in to memory (they should be pretty closely clustered
 and therefore mostly sequential I/O), and then push the filtering by channel
 onto the CPU instead.  If you've got billions of rows, you're going to end
 up disk bound anyway; minimizing physical I/O and random seeking around at
 the expense of CPU time could be a big win.

If they're put in in a one time load, load them in channel order into
the partitions, and the stats should see the perfect ordering and know
to seq scan the right part of the table.  I think.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

Re: [GENERAL] Best way to handle multi-billion row read-only table?

14 matches

Site Navigation

Mail list logo

Footer information