Re: [PERFORM] splitting data into multiple tables

2010-01-26 Thread Matthew Wakeling

On Mon, 25 Jan 2010, Viji V Nair wrote:

I think this wont help that much if you have a single machine. Partition the
table and keep the data in different nodes. Have a look at the tools like
pgpool.II


So partitioning. You have three choices:

1. Use a single table
2. Partition the table on the same server
3. Partition the data across multiple servers.

This is in increasing order of complexity.

There will probably be no problem at all with option 1. The only problem 
arises if you run a query that performs a full sequential scan of the 
entire table, which would obviously take a while. If your queries are 
indexable, then option 1 is almost certainly the best option.


Option 2 adds complexity in the Postgres server. You will need to 
partition your tables in a logical manner - that is, there needs to be 
some difference between rows in table a compared to rows in table b. This 
means that the partitioning will in effect be a little like indexing. You 
do not want to have too many partitions. The advantage is that if a query 
requires a full sequential scan, then there is the possibility of skipping 
some of the partitions, although there is some complexity involved in 
getting this to work correctly. In a lot of cases, partitioning will make 
queries slower by confusing the planner.


Option 3 is only useful when you have a real performance problem with 
long-running queries (partitioning the data across servers) or with very 
large numbers of queries (duplicating the data across servers). It also 
adds much complexity. It is fairly simple to run a filter these results 
from the table queries across multiple servers, but if that was all you 
were doing, you may as well use an index instead. It becomes impossible to 
perform proper cross-referencing queries without some very clever software 
(because not all the data is available on the server), which will probably 
be hard to manage and slow down the execution anyway.


My recommendation would be to stick with a single table unless you have a 
real need to partition.


Matthew

--
Note: some countries impose serious penalties for a conspiracy to overthrow
 the political system. THIS DOES NOT FIX THE VULNERABILITY.
  -- http://seclists.org/vulnwatch/2003/q2/0002.html

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] splitting data into multiple tables

2010-01-26 Thread Matthew Wakeling

On Mon, 25 Jan 2010, nair rajiv wrote:
I am working on a project that will take out structured content from 
wikipedia and put it in our database...

there is a table which will approximately have 5 crore entries after data
harvesting.


Have you asked the Wikimedia Foundation if they mind you consuming that 
much of their bandwidth, or even if there are copyright issues involved in 
grabbing that much of their data?


(The other problem with using the word crore is that although it may 
mean 1000 in a few countries, it could also mean 50.)


Matthew

--
Of course it's your fault. Everything here's your fault - it says so in your
contract.- Quark

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] splitting data into multiple tables

2010-01-26 Thread nair rajiv
On Tue, Jan 26, 2010 at 5:15 PM, Matthew Wakeling matt...@flymine.orgwrote:

 On Mon, 25 Jan 2010, nair rajiv wrote:

 I am working on a project that will take out structured content from
 wikipedia and put it in our database...

 there is a table which will approximately have 5 crore entries after data
 harvesting.


 Have you asked the Wikimedia Foundation if they mind you consuming that
 much of their bandwidth, or even if there are copyright issues involved in
 grabbing that much of their data?



We are downloading the nt and owl files kept for download at
http://wiki.dbpedia.org/Downloads34


 (The other problem with using the word crore is that although it may mean
 1000 in a few countries, it could also mean 50.)

 Matthew

 --
 Of course it's your fault. Everything here's your fault - it says so in
 your
 contract.- Quark



Re: [PERFORM] splitting data into multiple tables

2010-01-26 Thread Greg Smith

Viji V Nair wrote:
A 15k rpm SAS drive will give you a throughput of 12MB  and 120 IOPS. 
Now you can calculate the number of disks, specifically spindles, for 
getting your desired throughput and IOPs


I think you mean 120MB/s for that first part.  Regardless, presuming you 
can provision a database just based on IOPS rarely works.  It's nearly 
impossible to estimate what you really need anyway for a database app, 
given that much of real-world behavior depends on the cached in memory 
vs. uncached footprint of the data you're working with.  By the time you 
put a number of disks into an array, throw a controller card cache on 
top of it, then add the OS and PostgreSQL caches on top of those, you 
are so far disconnected from the underlying drive IOPS that speaking in 
those terms doesn't get you very far.  I struggle with this every time I 
talk with a SAN vendor.  Their fixation on IOPS without considering 
things like how sequential scans mixed into random I/O will get handled 
is really disconnected from how databases work in practice.  For 
example, I constantly end up needing to detune IOPS in favor of 
readahead to make SELECT x,y,z FROM t run at an acceptable speed on 
big tables.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] splitting data into multiple tables

2010-01-26 Thread Viji V Nair
On Tue, Jan 26, 2010 at 11:11 PM, Greg Smith g...@2ndquadrant.com wrote:

 Viji V Nair wrote:

 A 15k rpm SAS drive will give you a throughput of 12MB  and 120 IOPS. Now
 you can calculate the number of disks, specifically spindles, for getting
 your desired throughput and IOPs


 I think you mean 120MB/s for that first part.  Regardless, presuming you
 can provision a database just based on IOPS rarely works.  It's nearly
 impossible to estimate what you really need anyway for a database app, given
 that much of real-world behavior depends on the cached in memory vs.
 uncached footprint of the data you're working with.  By the time you put a
 number of disks into an array, throw a controller card cache on top of it,
 then add the OS and PostgreSQL caches on top of those, you are so far
 disconnected from the underlying drive IOPS that speaking in those terms
 doesn't get you very far.  I struggle with this every time I talk with a SAN
 vendor.  Their fixation on IOPS without considering things like how
 sequential scans mixed into random I/O will get handled is really
 disconnected from how databases work in practice.  For example, I constantly
 end up needing to detune IOPS in favor of readahead to make SELECT x,y,z
 FROM t run at an acceptable speed on big tables.


Yes, you are right.

There are catches in the SAN controllers also. SAN vendors wont give that
much information regarding their internal controller design. They will say
they have 4 external 4G ports, you should also check how many internal ports
they have and the how the controllers are operating,  in Active-Active or
Active- Standby mode.




 --
 Greg Smith2ndQuadrant   Baltimore, MD
 PostgreSQL Training, Services and Support
 g...@2ndquadrant.com  www.2ndQuadrant.com




Re: [PERFORM] splitting data into multiple tables

2010-01-26 Thread Greg Smith

Viji V Nair wrote:
There are catches in the SAN controllers also. SAN vendors wont give 
that much information regarding their internal controller design. They 
will say they have 4 external 4G ports, you should also check how many 
internal ports they have and the how the controllers are operating,  
in Active-Active or Active- Standby mode.


Right, the SAN cache serves the same purpose as the controller cache on 
direct-attached storage.  I've never seen a Fiber Channel card that had 
its own local cache too; doubt that's even possible.  So I think of them 
as basically being the same type of cache, with the primary difference 
being that the transfers between the host and the cache has some latency 
on it with FC compared to direct storage.


You're right that people should question the internal design too of 
course.  Some days I wonder if I'm in the wrong business--the people who 
do SAN tuning seem to have no idea what they're doing and yet are still 
expensive to hire.  But this is off-topic for the question being asked here.


--
Greg Smith2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com  www.2ndQuadrant.com


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] splitting data into multiple tables

2010-01-25 Thread nair rajiv
Hello,

  I am working on a project that will take out structured content
from wikipedia
and put it in our database. Before putting the data into the database I
wrote a script to
find out the number of rows every table would be having after the data is in
and I found
there is a table which will approximately have 5 crore entries after data
harvesting.
Is it advisable to keep so much data in one table ?
  I have read about 'partitioning' a table. An other idea I have is
to break the table into
different tables after the no of rows  in a table has reached a certain
limit say 10 lacs.
For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
each having 10 lac entries.
I needed advice on whether I should go for partitioning or the approach I
have thought of.
  We have a HP server with 32GB ram,16 processors. The storage has
24TB diskspace (1TB/HD).
We have put them on RAID-5. It will be great if we could know the parameters
that can be changed in the
postgres configuration file so that the database makes maximum utilization
of the server we have.
For eg parameters that would increase the speed of inserts and selects.


Thank you in advance
Rajiv Nair


Re: [PERFORM] splitting data into multiple tables

2010-01-25 Thread Kevin Grittner
nair rajiv nair...@gmail.com wrote:
 
 I found there is a table which will approximately have 5 crore
 entries after data harvesting.
 Is it advisable to keep so much data in one table ?
 
That's 50,000,000 rows, right?  At this site, you're looking at a
non-partitioned table with more than seven times that if you go to a
case and click the Court Record Events button:
 
http://wcca.wicourts.gov/
 
 I have read about 'partitioning' a table. An other idea I have is
 to break the table into different tables after the no of rows  in
 a table has reached a certain limit say 10 lacs.
 For example, dividing a table 'datatable' to 'datatable_a',
 'datatable_b' each having 10 lac entries.
 I needed advice on whether I should go for partitioning or the
 approach I have thought of.
 
It can help, and it can hurt.  It depends on the nature of the data
and how it is used.  To get a meaningful answer, I think we'd need
to know a bit more about it.
 
 We have a HP server with 32GB ram,16 processors. The storage has
 24TB diskspace (1TB/HD).
 We have put them on RAID-5. It will be great if we could know the
 parameters that can be changed in the postgres configuration file
 so that the database makes maximum utilization of the server we
 have.
 
Again, it depends a bit on the nature of the queries.  For ideas on
where to start, you might want to look here:
 
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
 
If you get any particular queries which aren't performing as well as
you think they should, you can post here with details.  See this for
information to include:
 
http://wiki.postgresql.org/wiki/SlowQueryQuestions
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] splitting data into multiple tables

2010-01-25 Thread Craig James

Kevin Grittner wrote:

nair rajiv nair...@gmail.com wrote:
 

I found there is a table which will approximately have 5 crore
entries after data harvesting.
Is it advisable to keep so much data in one table ?
 
That's 50,000,000 rows, right?


You should remember that words like lac and crore are not English words, and 
most English speakers around the world don't know what they mean.  Thousand, 
million, billion and so forth are the English words that everyone knows.

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] splitting data into multiple tables

2010-01-25 Thread nair rajiv
On Tue, Jan 26, 2010 at 1:01 AM, Craig James craig_ja...@emolecules.comwrote:

 Kevin Grittner wrote:

 nair rajiv nair...@gmail.com wrote:


 I found there is a table which will approximately have 5 crore
 entries after data harvesting.
 Is it advisable to keep so much data in one table ?

  That's 50,000,000 rows, right?


 You should remember that words like lac and crore are not English words,
 and most English speakers around the world don't know what they mean.
  Thousand, million, billion and so forth are the English words that everyone
 knows.




Oh I am Sorry. I wasn't aware of that
I repost my query with suggested changes.



Hello,

  I am working on a project that will take out structured content
from wikipedia
and put it in our database. Before putting the data into the database I
wrote a script to
find out the number of rows every table would be having after the data is in
and I found
there is a table which will approximately have 50,000,000 rows after data
harvesting.
Is it advisable to keep so much data in one table ?
  I have read about 'partitioning' a table. An other idea I have is
to break the table into
different tables after the no of rows  in a table has reached a certain
limit say 10,00,000.
For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
each having 10,00,000 rows.
I needed advice on whether I should go for partitioning or the approach I
have thought of.
  We have a HP server with 32GB ram,16 processors. The storage has
24TB diskspace (1TB/HD).
We have put them on RAID-5. It will be great if we could know the parameters
that can be changed in the
postgres configuration file so that the database makes maximum utilization
of the server we have.
For eg parameters that would increase the speed of inserts and selects.


Thank you in advance
Rajiv Nair


 Craig



Re: [PERFORM] splitting data into multiple tables

2010-01-25 Thread Andres Freund
On Tuesday 26 January 2010 01:39:48 nair rajiv wrote:
 On Tue, Jan 26, 2010 at 1:01 AM, Craig James 
craig_ja...@emolecules.comwrote:
   I am working on a project that will take out structured content
 from wikipedia
 and put it in our database. Before putting the data into the database I
 wrote a script to
 find out the number of rows every table would be having after the data is
 in and I found
 there is a table which will approximately have 50,000,000 rows after data
 harvesting.
 Is it advisable to keep so much data in one table ?
Depends on your access patterns. I.e. how many rows are you accessing at the 
same time - do those have some common locality and such.


   I have read about 'partitioning' a table. An other idea I have is
 to break the table into
 different tables after the no of rows  in a table has reached a certain
 limit say 10,00,000.
 For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
 each having 10,00,000 rows.
 I needed advice on whether I should go for partitioning or the approach I
 have thought of.
Your approach is pretty close to partitioning - except that partitioning makes 
that mostly invisible to the outside so it is imho preferrable.

   We have a HP server with 32GB ram,16 processors. The storage has
 24TB diskspace (1TB/HD).
 We have put them on RAID-5. It will be great if we could know the
 parameters that can be changed in the
 postgres configuration file so that the database makes maximum utilization
 of the server we have.
 For eg parameters that would increase the speed of inserts and selects.
Not using RAID-5 possibly would be a good start - many people (me included) 
experienced bad write performance on it. It depends a great deal on the 
controller/implementation though.
RAID-10 is normally to be considered more advantageous despite its lower 
usable space ratio.
Did you create one big RAID-5 out of all disks? Thats not a good idea, because 
its pretty likely that another disk fails while you restore a previously 
failed disk. Unfortunately in that configuration that means you have lost your 
complete data (in the most common implementations at least).

Andres

PS: Your lines are strangely wrapped...

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] splitting data into multiple tables

2010-01-25 Thread nair rajiv
On Tue, Jan 26, 2010 at 6:19 AM, Andres Freund and...@anarazel.de wrote:

 On Tuesday 26 January 2010 01:39:48 nair rajiv wrote:
  On Tue, Jan 26, 2010 at 1:01 AM, Craig James
 craig_ja...@emolecules.comwrote:
I am working on a project that will take out structured content
  from wikipedia
  and put it in our database. Before putting the data into the database I
  wrote a script to
  find out the number of rows every table would be having after the data is
  in and I found
  there is a table which will approximately have 50,000,000 rows after data
  harvesting.
  Is it advisable to keep so much data in one table ?
 Depends on your access patterns. I.e. how many rows are you accessing at
 the
 same time - do those have some common locality and such.


 I'll give a brief idea of how this table is. The important columns
are
subject, predicate and object. So given a predicate and object one should
be able to get all the subjects, given subject and a predicate one should
be able to retrieve all the objects. I have created an indexes on these
three
columns.



I have read about 'partitioning' a table. An other idea I have
 is
  to break the table into
  different tables after the no of rows  in a table has reached a certain
  limit say 10,00,000.
  For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
  each having 10,00,000 rows.
  I needed advice on whether I should go for partitioning or the approach I
  have thought of.
 Your approach is pretty close to partitioning - except that partitioning
 makes
 that mostly invisible to the outside so it is imho preferrable.

We have a HP server with 32GB ram,16 processors. The storage
 has
  24TB diskspace (1TB/HD).
  We have put them on RAID-5. It will be great if we could know the
  parameters that can be changed in the
  postgres configuration file so that the database makes maximum
 utilization
  of the server we have.
  For eg parameters that would increase the speed of inserts and selects.
 Not using RAID-5 possibly would be a good start - many people (me included)
 experienced bad write performance on it. It depends a great deal on the
 controller/implementation though.
 RAID-10 is normally to be considered more advantageous despite its lower
 usable space ratio.
 Did you create one big RAID-5 out of all disks? Thats not a good idea,
 because
 its pretty likely that another disk fails while you restore a previously
 failed disk. Unfortunately in that configuration that means you have lost
 your
 complete data (in the most common implementations at least).


No, I am using only 12TB i.e 12 HDs of the 24TB I have


 Andres

 PS: Your lines are strangely wrapped...



Re: [PERFORM] splitting data into multiple tables

2010-01-25 Thread Viji V Nair
On Tue, Jan 26, 2010 at 9:18 AM, nair rajiv nair...@gmail.com wrote:



 On Tue, Jan 26, 2010 at 6:19 AM, Andres Freund and...@anarazel.de wrote:

 On Tuesday 26 January 2010 01:39:48 nair rajiv wrote:
  On Tue, Jan 26, 2010 at 1:01 AM, Craig James
 craig_ja...@emolecules.comwrote:
I am working on a project that will take out structured
 content
  from wikipedia
  and put it in our database. Before putting the data into the database I
  wrote a script to
  find out the number of rows every table would be having after the data
 is
  in and I found
  there is a table which will approximately have 50,000,000 rows after
 data
  harvesting.
  Is it advisable to keep so much data in one table ?
 Depends on your access patterns. I.e. how many rows are you accessing at
 the
 same time - do those have some common locality and such.


  I'll give a brief idea of how this table is. The important columns
 are
 subject, predicate and object. So given a predicate and object one should
 be able to get all the subjects, given subject and a predicate one should
 be able to retrieve all the objects. I have created an indexes on these
 three
 columns.



I have read about 'partitioning' a table. An other idea I have
 is
  to break the table into
  different tables after the no of rows  in a table has reached a certain
  limit say 10,00,000.
  For example, dividing a table 'datatable' to 'datatable_a',
 'datatable_b'
  each having 10,00,000 rows.
  I needed advice on whether I should go for partitioning or the approach
 I
  have thought of.
 Your approach is pretty close to partitioning - except that partitioning
 makes
 that mostly invisible to the outside so it is imho preferrable.

We have a HP server with 32GB ram,16 processors. The storage
 has
  24TB diskspace (1TB/HD).
  We have put them on RAID-5. It will be great if we could know the
  parameters that can be changed in the
  postgres configuration file so that the database makes maximum
 utilization
  of the server we have.
  For eg parameters that would increase the speed of inserts and selects.
 Not using RAID-5 possibly would be a good start - many people (me
 included)
 experienced bad write performance on it. It depends a great deal on the
 controller/implementation though.
 RAID-10 is normally to be considered more advantageous despite its lower
 usable space ratio.
 Did you create one big RAID-5 out of all disks? Thats not a good idea,
 because
 its pretty likely that another disk fails while you restore a previously
 failed disk. Unfortunately in that configuration that means you have lost
 your
 complete data (in the most common implementations at least).


 No, I am using only 12TB i.e 12 HDs of the 24TB I have


A 15k rpm SAS drive will give you a throughput of 12MB  and 120 IOPS. Now
you can calculate the number of disks, specifically spindles, for getting
your desired throughput and IOPs



 Andres

 PS: Your lines are strangely wrapped...