Re: [PERFORM] splitting data into multiple tables

Viji V Nair Mon, 25 Jan 2010 23:58:34 -0800

On Tue, Jan 26, 2010 at 9:18 AM, nair rajiv <nair...@gmail.com> wrote:


>
>
> On Tue, Jan 26, 2010 at 6:19 AM, Andres Freund <and...@anarazel.de> wrote:
>
>> On Tuesday 26 January 2010 01:39:48 nair rajiv wrote:
>> > On Tue, Jan 26, 2010 at 1:01 AM, Craig James
>> <craig_ja...@emolecules.com>wrote:
>> >           I am working on a project that will take out structured
>> content
>> > from wikipedia
>> > and put it in our database. Before putting the data into the database I
>> > wrote a script to
>> > find out the number of rows every table would be having after the data
>> is
>> > in and I found
>> > there is a table which will approximately have 50,000,000 rows after
>> data
>> > harvesting.
>> > Is it advisable to keep so much data in one table ?
>> Depends on your access patterns. I.e. how many rows are you accessing at
>> the
>> same time - do those have some common locality and such.
>>
>
>          I'll give a brief idea of how this table is. The important columns
> are
> subject, predicate and object. So given a predicate and object one should
> be able to get all the subjects, given subject and a predicate one should
> be able to retrieve all the objects. I have created an indexes on these
> three
> columns.
>
>>
>>
>> >           I have read about 'partitioning' a table. An other idea I have
>> is
>> > to break the table into
>> > different tables after the no of rows  in a table has reached a certain
>> > limit say 10,00,000.
>> > For example, dividing a table 'datatable' to 'datatable_a',
>> 'datatable_b'
>> > each having 10,00,000 rows.
>> > I needed advice on whether I should go for partitioning or the approach
>> I
>> > have thought of.
>> Your approach is pretty close to partitioning - except that partitioning
>> makes
>> that mostly invisible to the outside so it is imho preferrable.
>>
>> >           We have a HP server with 32GB ram,16 processors. The storage
>> has
>> > 24TB diskspace (1TB/HD).
>> > We have put them on RAID-5. It will be great if we could know the
>> > parameters that can be changed in the
>> > postgres configuration file so that the database makes maximum
>> utilization
>> > of the server we have.
>> > For eg parameters that would increase the speed of inserts and selects.
>> Not using RAID-5 possibly would be a good start - many people (me
>> included)
>> experienced bad write performance on it. It depends a great deal on the
>> controller/implementation though.
>> RAID-10 is normally to be considered more advantageous despite its lower
>> usable space ratio.
>> Did you create one big RAID-5 out of all disks? Thats not a good idea,
>> because
>> its pretty likely that another disk fails while you restore a previously
>> failed disk. Unfortunately in that configuration that means you have lost
>> your
>> complete data (in the most common implementations at least).
>>
>
> No, I am using only 12TB i.e 12 HDs of the 24TB I have
>

A 15k rpm SAS drive will give you a throughput of 12MB  and 120 IOPS. Now
you can calculate the number of disks, specifically spindles, for getting
your desired throughput and IOPs


>
>> Andres
>>
>> PS: Your lines are strangely wrapped...
>>
>
>

Re: [PERFORM] splitting data into multiple tables

Reply via email to