Re: [PERFORM] splitting data into multiple tables
On Mon, 25 Jan 2010, Viji V Nair wrote: I think this wont help that much if you have a single machine. Partition the table and keep the data in different nodes. Have a look at the tools like pgpool.II So partitioning. You have three choices: 1. Use a single table 2. Partition the table on the same server 3. Partition the data across multiple servers. This is in increasing order of complexity. There will probably be no problem at all with option 1. The only problem arises if you run a query that performs a full sequential scan of the entire table, which would obviously take a while. If your queries are indexable, then option 1 is almost certainly the best option. Option 2 adds complexity in the Postgres server. You will need to partition your tables in a logical manner - that is, there needs to be some difference between rows in table a compared to rows in table b. This means that the partitioning will in effect be a little like indexing. You do not want to have too many partitions. The advantage is that if a query requires a full sequential scan, then there is the possibility of skipping some of the partitions, although there is some complexity involved in getting this to work correctly. In a lot of cases, partitioning will make queries slower by confusing the planner. Option 3 is only useful when you have a real performance problem with long-running queries (partitioning the data across servers) or with very large numbers of queries (duplicating the data across servers). It also adds much complexity. It is fairly simple to run a filter these results from the table queries across multiple servers, but if that was all you were doing, you may as well use an index instead. It becomes impossible to perform proper cross-referencing queries without some very clever software (because not all the data is available on the server), which will probably be hard to manage and slow down the execution anyway. My recommendation would be to stick with a single table unless you have a real need to partition. Matthew -- Note: some countries impose serious penalties for a conspiracy to overthrow the political system. THIS DOES NOT FIX THE VULNERABILITY. -- http://seclists.org/vulnwatch/2003/q2/0002.html -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] splitting data into multiple tables
On Mon, 25 Jan 2010, nair rajiv wrote: I am working on a project that will take out structured content from wikipedia and put it in our database... there is a table which will approximately have 5 crore entries after data harvesting. Have you asked the Wikimedia Foundation if they mind you consuming that much of their bandwidth, or even if there are copyright issues involved in grabbing that much of their data? (The other problem with using the word crore is that although it may mean 1000 in a few countries, it could also mean 50.) Matthew -- Of course it's your fault. Everything here's your fault - it says so in your contract.- Quark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] splitting data into multiple tables
On Tue, Jan 26, 2010 at 5:15 PM, Matthew Wakeling matt...@flymine.orgwrote: On Mon, 25 Jan 2010, nair rajiv wrote: I am working on a project that will take out structured content from wikipedia and put it in our database... there is a table which will approximately have 5 crore entries after data harvesting. Have you asked the Wikimedia Foundation if they mind you consuming that much of their bandwidth, or even if there are copyright issues involved in grabbing that much of their data? We are downloading the nt and owl files kept for download at http://wiki.dbpedia.org/Downloads34 (The other problem with using the word crore is that although it may mean 1000 in a few countries, it could also mean 50.) Matthew -- Of course it's your fault. Everything here's your fault - it says so in your contract.- Quark
Re: [PERFORM] splitting data into multiple tables
Viji V Nair wrote: A 15k rpm SAS drive will give you a throughput of 12MB and 120 IOPS. Now you can calculate the number of disks, specifically spindles, for getting your desired throughput and IOPs I think you mean 120MB/s for that first part. Regardless, presuming you can provision a database just based on IOPS rarely works. It's nearly impossible to estimate what you really need anyway for a database app, given that much of real-world behavior depends on the cached in memory vs. uncached footprint of the data you're working with. By the time you put a number of disks into an array, throw a controller card cache on top of it, then add the OS and PostgreSQL caches on top of those, you are so far disconnected from the underlying drive IOPS that speaking in those terms doesn't get you very far. I struggle with this every time I talk with a SAN vendor. Their fixation on IOPS without considering things like how sequential scans mixed into random I/O will get handled is really disconnected from how databases work in practice. For example, I constantly end up needing to detune IOPS in favor of readahead to make SELECT x,y,z FROM t run at an acceptable speed on big tables. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] splitting data into multiple tables
On Tue, Jan 26, 2010 at 11:11 PM, Greg Smith g...@2ndquadrant.com wrote: Viji V Nair wrote: A 15k rpm SAS drive will give you a throughput of 12MB and 120 IOPS. Now you can calculate the number of disks, specifically spindles, for getting your desired throughput and IOPs I think you mean 120MB/s for that first part. Regardless, presuming you can provision a database just based on IOPS rarely works. It's nearly impossible to estimate what you really need anyway for a database app, given that much of real-world behavior depends on the cached in memory vs. uncached footprint of the data you're working with. By the time you put a number of disks into an array, throw a controller card cache on top of it, then add the OS and PostgreSQL caches on top of those, you are so far disconnected from the underlying drive IOPS that speaking in those terms doesn't get you very far. I struggle with this every time I talk with a SAN vendor. Their fixation on IOPS without considering things like how sequential scans mixed into random I/O will get handled is really disconnected from how databases work in practice. For example, I constantly end up needing to detune IOPS in favor of readahead to make SELECT x,y,z FROM t run at an acceptable speed on big tables. Yes, you are right. There are catches in the SAN controllers also. SAN vendors wont give that much information regarding their internal controller design. They will say they have 4 external 4G ports, you should also check how many internal ports they have and the how the controllers are operating, in Active-Active or Active- Standby mode. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com
Re: [PERFORM] splitting data into multiple tables
Viji V Nair wrote: There are catches in the SAN controllers also. SAN vendors wont give that much information regarding their internal controller design. They will say they have 4 external 4G ports, you should also check how many internal ports they have and the how the controllers are operating, in Active-Active or Active- Standby mode. Right, the SAN cache serves the same purpose as the controller cache on direct-attached storage. I've never seen a Fiber Channel card that had its own local cache too; doubt that's even possible. So I think of them as basically being the same type of cache, with the primary difference being that the transfers between the host and the cache has some latency on it with FC compared to direct storage. You're right that people should question the internal design too of course. Some days I wonder if I'm in the wrong business--the people who do SAN tuning seem to have no idea what they're doing and yet are still expensive to hire. But this is off-topic for the question being asked here. -- Greg Smith2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] splitting data into multiple tables
Hello, I am working on a project that will take out structured content from wikipedia and put it in our database. Before putting the data into the database I wrote a script to find out the number of rows every table would be having after the data is in and I found there is a table which will approximately have 5 crore entries after data harvesting. Is it advisable to keep so much data in one table ? I have read about 'partitioning' a table. An other idea I have is to break the table into different tables after the no of rows in a table has reached a certain limit say 10 lacs. For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10 lac entries. I needed advice on whether I should go for partitioning or the approach I have thought of. We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD). We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the postgres configuration file so that the database makes maximum utilization of the server we have. For eg parameters that would increase the speed of inserts and selects. Thank you in advance Rajiv Nair
Re: [PERFORM] splitting data into multiple tables
nair rajiv nair...@gmail.com wrote: I found there is a table which will approximately have 5 crore entries after data harvesting. Is it advisable to keep so much data in one table ? That's 50,000,000 rows, right? At this site, you're looking at a non-partitioned table with more than seven times that if you go to a case and click the Court Record Events button: http://wcca.wicourts.gov/ I have read about 'partitioning' a table. An other idea I have is to break the table into different tables after the no of rows in a table has reached a certain limit say 10 lacs. For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10 lac entries. I needed advice on whether I should go for partitioning or the approach I have thought of. It can help, and it can hurt. It depends on the nature of the data and how it is used. To get a meaningful answer, I think we'd need to know a bit more about it. We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD). We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the postgres configuration file so that the database makes maximum utilization of the server we have. Again, it depends a bit on the nature of the queries. For ideas on where to start, you might want to look here: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server If you get any particular queries which aren't performing as well as you think they should, you can post here with details. See this for information to include: http://wiki.postgresql.org/wiki/SlowQueryQuestions -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] splitting data into multiple tables
Kevin Grittner wrote: nair rajiv nair...@gmail.com wrote: I found there is a table which will approximately have 5 crore entries after data harvesting. Is it advisable to keep so much data in one table ? That's 50,000,000 rows, right? You should remember that words like lac and crore are not English words, and most English speakers around the world don't know what they mean. Thousand, million, billion and so forth are the English words that everyone knows. Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] splitting data into multiple tables
On Tue, Jan 26, 2010 at 1:01 AM, Craig James craig_ja...@emolecules.comwrote: Kevin Grittner wrote: nair rajiv nair...@gmail.com wrote: I found there is a table which will approximately have 5 crore entries after data harvesting. Is it advisable to keep so much data in one table ? That's 50,000,000 rows, right? You should remember that words like lac and crore are not English words, and most English speakers around the world don't know what they mean. Thousand, million, billion and so forth are the English words that everyone knows. Oh I am Sorry. I wasn't aware of that I repost my query with suggested changes. Hello, I am working on a project that will take out structured content from wikipedia and put it in our database. Before putting the data into the database I wrote a script to find out the number of rows every table would be having after the data is in and I found there is a table which will approximately have 50,000,000 rows after data harvesting. Is it advisable to keep so much data in one table ? I have read about 'partitioning' a table. An other idea I have is to break the table into different tables after the no of rows in a table has reached a certain limit say 10,00,000. For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10,00,000 rows. I needed advice on whether I should go for partitioning or the approach I have thought of. We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD). We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the postgres configuration file so that the database makes maximum utilization of the server we have. For eg parameters that would increase the speed of inserts and selects. Thank you in advance Rajiv Nair Craig
Re: [PERFORM] splitting data into multiple tables
On Tuesday 26 January 2010 01:39:48 nair rajiv wrote: On Tue, Jan 26, 2010 at 1:01 AM, Craig James craig_ja...@emolecules.comwrote: I am working on a project that will take out structured content from wikipedia and put it in our database. Before putting the data into the database I wrote a script to find out the number of rows every table would be having after the data is in and I found there is a table which will approximately have 50,000,000 rows after data harvesting. Is it advisable to keep so much data in one table ? Depends on your access patterns. I.e. how many rows are you accessing at the same time - do those have some common locality and such. I have read about 'partitioning' a table. An other idea I have is to break the table into different tables after the no of rows in a table has reached a certain limit say 10,00,000. For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10,00,000 rows. I needed advice on whether I should go for partitioning or the approach I have thought of. Your approach is pretty close to partitioning - except that partitioning makes that mostly invisible to the outside so it is imho preferrable. We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD). We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the postgres configuration file so that the database makes maximum utilization of the server we have. For eg parameters that would increase the speed of inserts and selects. Not using RAID-5 possibly would be a good start - many people (me included) experienced bad write performance on it. It depends a great deal on the controller/implementation though. RAID-10 is normally to be considered more advantageous despite its lower usable space ratio. Did you create one big RAID-5 out of all disks? Thats not a good idea, because its pretty likely that another disk fails while you restore a previously failed disk. Unfortunately in that configuration that means you have lost your complete data (in the most common implementations at least). Andres PS: Your lines are strangely wrapped... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] splitting data into multiple tables
On Tue, Jan 26, 2010 at 6:19 AM, Andres Freund and...@anarazel.de wrote: On Tuesday 26 January 2010 01:39:48 nair rajiv wrote: On Tue, Jan 26, 2010 at 1:01 AM, Craig James craig_ja...@emolecules.comwrote: I am working on a project that will take out structured content from wikipedia and put it in our database. Before putting the data into the database I wrote a script to find out the number of rows every table would be having after the data is in and I found there is a table which will approximately have 50,000,000 rows after data harvesting. Is it advisable to keep so much data in one table ? Depends on your access patterns. I.e. how many rows are you accessing at the same time - do those have some common locality and such. I'll give a brief idea of how this table is. The important columns are subject, predicate and object. So given a predicate and object one should be able to get all the subjects, given subject and a predicate one should be able to retrieve all the objects. I have created an indexes on these three columns. I have read about 'partitioning' a table. An other idea I have is to break the table into different tables after the no of rows in a table has reached a certain limit say 10,00,000. For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10,00,000 rows. I needed advice on whether I should go for partitioning or the approach I have thought of. Your approach is pretty close to partitioning - except that partitioning makes that mostly invisible to the outside so it is imho preferrable. We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD). We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the postgres configuration file so that the database makes maximum utilization of the server we have. For eg parameters that would increase the speed of inserts and selects. Not using RAID-5 possibly would be a good start - many people (me included) experienced bad write performance on it. It depends a great deal on the controller/implementation though. RAID-10 is normally to be considered more advantageous despite its lower usable space ratio. Did you create one big RAID-5 out of all disks? Thats not a good idea, because its pretty likely that another disk fails while you restore a previously failed disk. Unfortunately in that configuration that means you have lost your complete data (in the most common implementations at least). No, I am using only 12TB i.e 12 HDs of the 24TB I have Andres PS: Your lines are strangely wrapped...
Re: [PERFORM] splitting data into multiple tables
On Tue, Jan 26, 2010 at 9:18 AM, nair rajiv nair...@gmail.com wrote: On Tue, Jan 26, 2010 at 6:19 AM, Andres Freund and...@anarazel.de wrote: On Tuesday 26 January 2010 01:39:48 nair rajiv wrote: On Tue, Jan 26, 2010 at 1:01 AM, Craig James craig_ja...@emolecules.comwrote: I am working on a project that will take out structured content from wikipedia and put it in our database. Before putting the data into the database I wrote a script to find out the number of rows every table would be having after the data is in and I found there is a table which will approximately have 50,000,000 rows after data harvesting. Is it advisable to keep so much data in one table ? Depends on your access patterns. I.e. how many rows are you accessing at the same time - do those have some common locality and such. I'll give a brief idea of how this table is. The important columns are subject, predicate and object. So given a predicate and object one should be able to get all the subjects, given subject and a predicate one should be able to retrieve all the objects. I have created an indexes on these three columns. I have read about 'partitioning' a table. An other idea I have is to break the table into different tables after the no of rows in a table has reached a certain limit say 10,00,000. For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10,00,000 rows. I needed advice on whether I should go for partitioning or the approach I have thought of. Your approach is pretty close to partitioning - except that partitioning makes that mostly invisible to the outside so it is imho preferrable. We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD). We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the postgres configuration file so that the database makes maximum utilization of the server we have. For eg parameters that would increase the speed of inserts and selects. Not using RAID-5 possibly would be a good start - many people (me included) experienced bad write performance on it. It depends a great deal on the controller/implementation though. RAID-10 is normally to be considered more advantageous despite its lower usable space ratio. Did you create one big RAID-5 out of all disks? Thats not a good idea, because its pretty likely that another disk fails while you restore a previously failed disk. Unfortunately in that configuration that means you have lost your complete data (in the most common implementations at least). No, I am using only 12TB i.e 12 HDs of the 24TB I have A 15k rpm SAS drive will give you a throughput of 12MB and 120 IOPS. Now you can calculate the number of disks, specifically spindles, for getting your desired throughput and IOPs Andres PS: Your lines are strangely wrapped...