Re: [GENERAL] Best way to handle multi-billion row read-only table?
John R Pierce wrote: how do you plan on accessing this monster data? do you expect to be looking up single values or small set of values at a specific time? seems to me like this is the sort of data thats more often processed in the aggregate, like running a fourier analysis of sliding windows, and that sort of data processing may well be more efficiently done with fixed block binary files rather than relational databases, as there's no real relationships in this data. The data will initially be accessed via a simple GUI which will allow browsing over a subset of the data (subsampled down to 1 sample/minute/hour, etc. during the data load phase and so massively smaller datasets) and then once something interesting has been found manually (fully automatic detection of transients has to wait until we've formally described what we mean by transient :-)) the start and end times can be handed over to our automatic processing code to go through the full dataset. I did consider just sticking the data into a series of big dumb files but by putting them in a DB I can both maintain automatic links between the full and subsampled data sets and between each data point and the equipment that measured it and, possibly more importantly, I can provide a simpler interface to the other people on my project to access the data. I'm a computer scientist but I'm doing my PhD in the Civil Engineering dept and all of my colleagues are civil engineers - all quite happy using Matlab's database plugin but less happy writing traditional code to crunch through raw files. I'm aware that I'm taking a, possibly quite large, performance hit by using a database but I'm hoping that the advantages will outweigh this. Many thanks for all the replies to my query. I'm going to go with a partitioned table design and start uploading some data. I'll post how it performs once I've got some real size data in it. Asher. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
On Wed, Feb 10, 2010 at 2:32 PM, Asher as...@piceur.co.uk wrote: The data will initially be accessed via a simple GUI which will allow browsing over a subset of the data (subsampled down to 1 sample/minute/hour, etc. It sounds like you could use a tool like rrd that keeps various levels of aggregation and intelligently chooses the right level for the given query. I think there are such tools though I'm not sure there are any free ones. -- greg -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
2010/2/10 Greg Stark gsst...@mit.edu: On Wed, Feb 10, 2010 at 2:32 PM, Asher as...@piceur.co.uk wrote: The data will initially be accessed via a simple GUI which will allow browsing over a subset of the data (subsampled down to 1 sample/minute/hour, etc. It sounds like you could use a tool like rrd that keeps various levels of aggregation and intelligently chooses the right level for the given query. I think there are such tools though I'm not sure there are any free ones. Use as much memory as possible to fit indexes as well as portions of the table space itself in RAM. Of course, poor indexing can kill any effort. -- Vincenzo Romano NotOrAnd Information Technologies NON QVIETIS MARIBVS NAVTA PERITVS -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
On 2/9/2010 12:47 PM, Asher wrote: Hello. I'm putting together a database to store the readings from various measurement devices for later processing. Since these things (water pressure monitors attached to very large water pipes) take readings at 200Hz and are typically deployed over multiple sites for several months at a time I've got many billions of rows of data, each (at the moment) with the following simple format: value REAL NOT NULL, sample_time TIMESTAMP WITH TIME ZONE NOT NULL, channel INTEGER REFERENCES channel(id) NOT NULL (Where the channel table contains metadata to identify the particular sensor, data logger, etc. used to obtain the data and the combination of channel and sample_time is unique.) Well first is that 200hz meaning 200 samples per channel per second. That is very fast sampling for pressure sensor, I would be surprised if the meters are actually giving real results at that rate. I would look at reducing that down to what the meter is actual capable of sending What kind of AD card is being used as this effects what makes sense to record. I would look into table partitioning http://www.postgresql.org/docs/current/static/ddl-partitioning.html http://wiki.postgresql.org/wiki/Table_partitioning A one big index for such a small record will not be a big win because the index are going to be the same size as table. Look into limiting the number of records each index covers. http://www.postgresql.org/docs/8.4/static/sql-createindex.html All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by our proprietary quotation system. Quotations received via any other form of communication will not be honored. CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mail immediately. Thank you. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
I've only gone up to about a billion rows, but table partitioning seems to be the way to go to me. I did per-day partitioning, and just had the job that inserts the daily data add the table automatically. With the partitioning, it only has to pull up the tables for the specific days, and is therefore a kind of date index already. Alex Thurlow Blastro Networks http://www.blastro.com http://www.roxwel.com http://www.yallwire.com On 2/9/2010 11:47 AM, Asher wrote: Hello. I'm putting together a database to store the readings from various measurement devices for later processing. Since these things (water pressure monitors attached to very large water pipes) take readings at 200Hz and are typically deployed over multiple sites for several months at a time I've got many billions of rows of data, each (at the moment) with the following simple format: value REAL NOT NULL, sample_time TIMESTAMP WITH TIME ZONE NOT NULL, channel INTEGER REFERENCES channel(id) NOT NULL (Where the channel table contains metadata to identify the particular sensor, data logger, etc. used to obtain the data and the combination of channel and sample_time is unique.) Once loaded into the database the data will never be deleted or modified and will typically be accessed over a particular date range for a particular channel (e.g. sample_time = X AND sample_time = Y AND channel=Z). A typical query won't return more than a few million rows and speed is not desperately important (as long as the time is measured in minutes rather than hours). Are there any recommended ways to organise this? Should I partition my big table into multiple smaller ones which will always fit in memory (this would result in several hundreds or thousands of sub-tables)? Are there any ways to keep the index size to a minimum? At the moment I have a few weeks of data, about 180GB, loaded into a single table and indexed on sample_time and channel and the index takes up 180GB too. Since this is all for a typically budget-restricted PhD project the hardware is just a high-end desktop workstation with (at the moment) 2*2TB drives organised into a single 4TB partition using FreeBSD's vinum system. Many thanks for any help, Asher. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
Justin Graf wrote: Well first is that 200hz meaning 200 samples per channel per second. That is very fast sampling for pressure sensor, I would be surprised if the meters are actually giving real results at that rate. I would look at reducing that down to what the meter is actual capable of sending What kind of AD card is being used as this effects what makes sense to record. Yes, we really are measuring at 200 samples per second. We're trying to capture high resolution images of pressure transients as they move along water distribution pipelines (the backbones of the water network, typically 4'-6' in diameter, carrying 500-1000 litres/second) to understand how they travel and what stress they put upon the pipe. We're using custom data loggers at the moment based around Intel iMote2 Linux systems with a high-speed QuickFilter ADC (and the sensors we're using can cope at 200Hz). I would look into table partitioning http://www.postgresql.org/docs/current/static/ddl-partitioning.html http://wiki.postgresql.org/wiki/Table_partitioning Thanks for that, it looks like partitioning is the way to go. I'm assuming that I should try and keep my total_relation_sizes less than the memory size of the machine? A one big index for such a small record will not be a big win because the index are going to be the same size as table. Look into limiting the number of records each index covers. http://www.postgresql.org/docs/8.4/static/sql-createindex.html If I partition so that each partition holds data for a single channel (and set a CHECK constraint for this) then I can presumably remove the channel from the index since constraint exclusion will mean that only partitions holding the channel I'm interested in will be searched in a query. Given that within a partition all of my sample_time's will be different do you know if there's a more efficient way to index these? Many thanks, Asher -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
On Tue, Feb 9, 2010 at 10:47 AM, Asher as...@piceur.co.uk wrote: Hello. I'm putting together a database to store the readings from various measurement devices for later processing. Since these things (water pressure monitors attached to very large water pipes) take readings at 200Hz and are typically deployed over multiple sites for several months at a time I've got many billions of rows of data, each (at the moment) with the following simple format: value REAL NOT NULL, sample_time TIMESTAMP WITH TIME ZONE NOT NULL, channel INTEGER REFERENCES channel(id) NOT NULL (Where the channel table contains metadata to identify the particular sensor, data logger, etc. used to obtain the data and the combination of channel and sample_time is unique.) Once loaded into the database the data will never be deleted or modified and will typically be accessed over a particular date range for a particular channel (e.g. sample_time = X AND sample_time = Y AND channel=Z). A typical query won't return more than a few million rows and speed is not desperately important (as long as the time is measured in minutes rather than hours). Are there any recommended ways to organise this? Should I partition my big table into multiple smaller ones which will always fit in memory (this would result in several hundreds or thousands of sub-tables)? Partitioning is the standard way. Note that you can partition on 1 axis, or 2, or more. In this case partitioning on time and channel might make the most sense. Are there any ways to keep the index size to a minimum? At the moment I have a few weeks of data, about 180GB, loaded into a single table and indexed on sample_time and channel and the index takes up 180GB too. It may be that with small enough partitions indexes aren't really needed. That's been the case for a lot of data I've worked with in the past. Since this is all for a typically budget-restricted PhD project the hardware is just a high-end desktop workstation with (at the moment) 2*2TB drives organised into a single 4TB partition using FreeBSD's vinum system. Partitioning should definitely help. You might want to go with RAID-1 instead of RAID-0 since the read performance is similar under most modern OSes. I know linux now aggregates the two drives together to read, I'd assume BSD does too. That way you've got better reliability and about the same performance. Load times will be about double, but that's a one time thing, right? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
Asher Hoskins wrote: If I partition so that each partition holds data for a single channel (and set a CHECK constraint for this) then I can presumably remove the channel from the index since constraint exclusion will mean that only partitions holding the channel I'm interested in will be searched in a query. Given that within a partition all of my sample_time's will be different do you know if there's a more efficient way to index these? how do you plan on accessing this monster data? do you expect to be looking up single values or small set of values at a specific time? seems to me like this is the sort of data thats more often processed in the aggregate, like running a fourier analysis of sliding windows, and that sort of data processing may well be more efficiently done with fixed block binary files rather than relational databases, as there's no real relationships in this data. for instance, a directory for each sensor, with a directory for each week, and a file for each hour, containing the hours worth of samples in fixed binary blocks after a file header identifying it. you can random access a specific time sample by using fseek (sampletime-starttimeofblock) * blocksize + headersize or whatever. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
On 2/9/2010 4:41 PM, Asher Hoskins wrote: Thanks for that, it looks like partitioning is the way to go. I'm assuming that I should try and keep my total_relation_sizes less than the memory size of the machine? This depends on what the quires look like. As other have stated when partitioning you have to consider how the data is quired. If I partition so that each partition holds data for a single channel (and set a CHECK constraint for this) then I can presumably remove the channel from the index since constraint exclusion will mean that only partitions holding the channel I'm interested in will be searched in a query. Given that within a partition all of my sample_time's will be different do you know if there's a more efficient way to index these? Given the timestamp will most likely be the where clause, NO on the plus side its only 8 bytes All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by our proprietary quotation system. Quotations received via any other form of communication will not be honored. CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mail immediately. Thank you. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
Is it also possible to denormalize by putting the 'channel' data in the first table (especially if it isn't very much)? Maintaining a foreign key constraint can impact performance significantly in most RDBMS's, even when deferring checking. I could be wrong, but I suspect PostgreSQL is no different. Or keep the data normalized and remove the constraint altogether. Also remove any primary key constraint so that it doesn't have to check uniqueness, and avoid as many indexes as you can. You have to take a leap of faith that you created your program well enough to not get out of sync. I would be interested to hear comments on this. These are some of the things we did on systems I have worked on running Oracle that handled even higher volumes (tens to hundreds of thousands of transactions per second or higher... sustained throughout the day at least on the lower volume). Granted we had real heavy hardware but the DBAs forbade us to create constraints and indexes etc. for this reason; except on less active tables. Everyone has already talked about partitioning, but load balancing across machines if you can afford a couple or few more could help too. Not sure what facility Postgres has for this though (I would be interested to hear comments on this too! :-) BillR -Original Message- From: pgsql-general-ow...@postgresql.org [mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Asher Sent: February-09-10 12:47 PM To: pgsql-general@postgresql.org Subject: [GENERAL] Best way to handle multi-billion row read-only table? Hello. I'm putting together a database to store the readings from various measurement devices for later processing. Since these things (water pressure monitors attached to very large water pipes) take readings at 200Hz and are typically deployed over multiple sites for several months at a time I've got many billions of rows of data, each (at the moment) with the following simple format: value REAL NOT NULL, sample_time TIMESTAMP WITH TIME ZONE NOT NULL, channel INTEGER REFERENCES channel(id) NOT NULL (Where the channel table contains metadata to identify the particular sensor, data logger, etc. used to obtain the data and the combination of channel and sample_time is unique.) Once loaded into the database the data will never be deleted or modified and will typically be accessed over a particular date range for a particular channel (e.g. sample_time = X AND sample_time = Y AND channel=Z). A typical query won't return more than a few million rows and speed is not desperately important (as long as the time is measured in minutes rather than hours). Are there any recommended ways to organise this? Should I partition my big table into multiple smaller ones which will always fit in memory (this would result in several hundreds or thousands of sub-tables)? Are there any ways to keep the index size to a minimum? At the moment I have a few weeks of data, about 180GB, loaded into a single table and indexed on sample_time and channel and the index takes up 180GB too. Since this is all for a typically budget-restricted PhD project the hardware is just a high-end desktop workstation with (at the moment) 2*2TB drives organised into a single 4TB partition using FreeBSD's vinum system. Many thanks for any help, Asher. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general Spam/Virus scanning by CanIt Pro For more information see http://www.kgbinternet.com/SpamFilter.htm To control your spam filter, log in at http://filter.kgbinternet.com -- BEGIN-ANTISPAM-VOTING-LINKS -- Teach CanIt if this mail (ID 80270060) is spam: Spam: http://filter.kgbinternet.com/canit/b.php?i=80270060m=5d99840e72f9t=201002 09c=s Not spam: http://filter.kgbinternet.com/canit/b.php?i=80270060m=5d99840e72f9t=201002 09c=n Forget vote: http://filter.kgbinternet.com/canit/b.php?i=80270060m=5d99840e72f9t=201002 09c=f -- END-ANTISPAM-VOTING-LINKS __ Information from ESET Smart Security, version of virus signature database 4852 (20100209) __ The message was checked by ESET Smart Security. http://www.eset.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
On Tue, Feb 9, 2010 at 8:06 PM, BillR iamb...@williamrosmus.com wrote: Is it also possible to denormalize by putting the 'channel' data in the first table (especially if it isn't very much)? Maintaining a foreign key constraint can impact performance significantly in most RDBMS's, even when deferring checking. I could be wrong, but I suspect PostgreSQL is no different. Or keep the data normalized and remove the constraint altogether. Also remove any primary key constraint so that it doesn't have to check uniqueness, and avoid as many indexes as you can. You have to take a leap of faith that you created your program well enough to not get out of sync. I would be interested to hear comments on this. These are some of the things we did on systems I have worked on running Oracle that handled even higher volumes (tens to hundreds of thousands of transactions per second or higher... sustained throughout the day at least on the lower volume). Granted we had real heavy hardware but the DBAs forbade us to create constraints and indexes etc. for this reason; except on less active tables. Everyone has already talked about partitioning, but load balancing across machines if you can afford a couple or few more could help too. Not sure what facility Postgres has for this though (I would be interested to hear I was under the impression the data was being gathered elsewhere and then imported, so the insert performance isn't as critical as if it was being done real time. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
If you will be selecting sets of data within a time range, it should also improve performance if you can build a clustered index on the sample_time. It may also be worth looking at whether partitioning by timestamp channel offers any advantages. Brent Wood Brent Wood DBA/GIS consultant NIWA, Wellington New Zealand Justin Graf 02/10/10 3:07 PM On 2/9/2010 4:41 PM, Asher Hoskins wrote: Thanks for that, it looks like partitioning is the way to go. I'm assuming that I should try and keep my total_relation_sizes less than the memory size of the machine? This depends on what the quires look like. As other have stated when partitioning you have to consider how the data is quired. If I partition so that each partition holds data for a single channel (and set a CHECK constraint for this) then I can presumably remove the channel from the index since constraint exclusion will mean that only partitions holding the channel I'm interested in will be searched in a query. Given that within a partition all of my sample_time's will be different do you know if there's a more efficient way to index these? Given the timestamp will most likely be the where clause, NO on the plus side its only 8 bytes All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by our proprietary quotation system. Quotations received via any other form of communication will not be honored. CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mail immediately. Thank you. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general NIWA is the trading name of the National Institute of Water Atmospheric Research Ltd.
Re: [GENERAL] Best way to handle multi-billion row read-only table?
Asher wrote: Once loaded into the database the data will never be deleted or modified and will typically be accessed over a particular date range for a particular channel (e.g. sample_time = X AND sample_time = Y AND channel=Z). A typical query won't return more than a few million rows and speed is not desperately important (as long as the time is measured in minutes rather than hours). Are there any recommended ways to organise this? Should I partition my big table into multiple smaller ones which will always fit in memory (this would result in several hundreds or thousands of sub-tables)? Are there any ways to keep the index size to a minimum? At the moment I have a few weeks of data, about 180GB, loaded into a single table and indexed on sample_time and channel and the index takes up 180GB too. One approach to consider is partitioning by sample_time and not even including the channel number in the index. You've got tiny records; there's going to be hundreds of channels of data on each data page pulled in, right? Why not minimize physical I/O by reducing the index size, just read that whole section of time in to memory (they should be pretty closely clustered and therefore mostly sequential I/O), and then push the filtering by channel onto the CPU instead. If you've got billions of rows, you're going to end up disk bound anyway; minimizing physical I/O and random seeking around at the expense of CPU time could be a big win. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Best way to handle multi-billion row read-only table?
On Tue, Feb 9, 2010 at 11:51 PM, Greg Smith g...@2ndquadrant.com wrote: Asher wrote: Once loaded into the database the data will never be deleted or modified and will typically be accessed over a particular date range for a particular channel (e.g. sample_time = X AND sample_time = Y AND channel=Z). A typical query won't return more than a few million rows and speed is not desperately important (as long as the time is measured in minutes rather than hours). Are there any recommended ways to organise this? Should I partition my big table into multiple smaller ones which will always fit in memory (this would result in several hundreds or thousands of sub-tables)? Are there any ways to keep the index size to a minimum? At the moment I have a few weeks of data, about 180GB, loaded into a single table and indexed on sample_time and channel and the index takes up 180GB too. One approach to consider is partitioning by sample_time and not even including the channel number in the index. You've got tiny records; there's going to be hundreds of channels of data on each data page pulled in, right? Why not minimize physical I/O by reducing the index size, just read that whole section of time in to memory (they should be pretty closely clustered and therefore mostly sequential I/O), and then push the filtering by channel onto the CPU instead. If you've got billions of rows, you're going to end up disk bound anyway; minimizing physical I/O and random seeking around at the expense of CPU time could be a big win. If they're put in in a one time load, load them in channel order into the partitions, and the stats should see the perfect ordering and know to seq scan the right part of the table. I think. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general