Re: [PERFORM] splitting data into multiple tables

2010-01-26 Thread nair rajiv
On Tue, Jan 26, 2010 at 5:15 PM, Matthew Wakeling wrote:

> On Mon, 25 Jan 2010, nair rajiv wrote:
>
>> I am working on a project that will take out structured content from
>> wikipedia and put it in our database...
>>
>> there is a table which will approximately have 5 crore entries after data
>> harvesting.
>>
>
> Have you asked the Wikimedia Foundation if they mind you consuming that
> much of their bandwidth, or even if there are copyright issues involved in
> grabbing that much of their data?
>


We are downloading the nt and owl files kept for download at
http://wiki.dbpedia.org/Downloads34


> (The other problem with using the word "crore" is that although it may mean
> 1000 in a few countries, it could also mean 50.)
>
> Matthew
>
> --
> Of course it's your fault. Everything here's your fault - it says so in
> your
> contract.- Quark
>


Re: [PERFORM] splitting data into multiple tables

2010-01-25 Thread nair rajiv
On Tue, Jan 26, 2010 at 6:19 AM, Andres Freund  wrote:

> On Tuesday 26 January 2010 01:39:48 nair rajiv wrote:
> > On Tue, Jan 26, 2010 at 1:01 AM, Craig James
> wrote:
> >   I am working on a project that will take out structured content
> > from wikipedia
> > and put it in our database. Before putting the data into the database I
> > wrote a script to
> > find out the number of rows every table would be having after the data is
> > in and I found
> > there is a table which will approximately have 50,000,000 rows after data
> > harvesting.
> > Is it advisable to keep so much data in one table ?
> Depends on your access patterns. I.e. how many rows are you accessing at
> the
> same time - do those have some common locality and such.
>

 I'll give a brief idea of how this table is. The important columns
are
subject, predicate and object. So given a predicate and object one should
be able to get all the subjects, given subject and a predicate one should
be able to retrieve all the objects. I have created an indexes on these
three
columns.

>
>
> >   I have read about 'partitioning' a table. An other idea I have
> is
> > to break the table into
> > different tables after the no of rows  in a table has reached a certain
> > limit say 10,00,000.
> > For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
> > each having 10,00,000 rows.
> > I needed advice on whether I should go for partitioning or the approach I
> > have thought of.
> Your approach is pretty close to partitioning - except that partitioning
> makes
> that mostly invisible to the outside so it is imho preferrable.
>
> >   We have a HP server with 32GB ram,16 processors. The storage
> has
> > 24TB diskspace (1TB/HD).
> > We have put them on RAID-5. It will be great if we could know the
> > parameters that can be changed in the
> > postgres configuration file so that the database makes maximum
> utilization
> > of the server we have.
> > For eg parameters that would increase the speed of inserts and selects.
> Not using RAID-5 possibly would be a good start - many people (me included)
> experienced bad write performance on it. It depends a great deal on the
> controller/implementation though.
> RAID-10 is normally to be considered more advantageous despite its lower
> usable space ratio.
> Did you create one big RAID-5 out of all disks? Thats not a good idea,
> because
> its pretty likely that another disk fails while you restore a previously
> failed disk. Unfortunately in that configuration that means you have lost
> your
> complete data (in the most common implementations at least).
>

No, I am using only 12TB i.e 12 HDs of the 24TB I have

>
> Andres
>
> PS: Your lines are strangely wrapped...
>


Re: [PERFORM] splitting data into multiple tables

2010-01-25 Thread nair rajiv
On Tue, Jan 26, 2010 at 1:01 AM, Craig James wrote:

> Kevin Grittner wrote:
>
>> nair rajiv  wrote:
>>
>>
>>> I found there is a table which will approximately have 5 crore
>>> entries after data harvesting.
>>> Is it advisable to keep so much data in one table ?
>>>
>>  That's 50,000,000 rows, right?
>>
>
> You should remember that words like lac and crore are not English words,
> and most English speakers around the world don't know what they mean.
>  Thousand, million, billion and so forth are the English words that everyone
> knows.
>



Oh I am Sorry. I wasn't aware of that
I repost my query with suggested changes.



Hello,

  I am working on a project that will take out structured content
from wikipedia
and put it in our database. Before putting the data into the database I
wrote a script to
find out the number of rows every table would be having after the data is in
and I found
there is a table which will approximately have 50,000,000 rows after data
harvesting.
Is it advisable to keep so much data in one table ?
  I have read about 'partitioning' a table. An other idea I have is
to break the table into
different tables after the no of rows  in a table has reached a certain
limit say 10,00,000.
For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
each having 10,00,000 rows.
I needed advice on whether I should go for partitioning or the approach I
have thought of.
  We have a HP server with 32GB ram,16 processors. The storage has
24TB diskspace (1TB/HD).
We have put them on RAID-5. It will be great if we could know the parameters
that can be changed in the
postgres configuration file so that the database makes maximum utilization
of the server we have.
For eg parameters that would increase the speed of inserts and selects.


Thank you in advance
Rajiv Nair

>
> Craig
>


[PERFORM] splitting data into multiple tables

2010-01-25 Thread nair rajiv
Hello,

  I am working on a project that will take out structured content
from wikipedia
and put it in our database. Before putting the data into the database I
wrote a script to
find out the number of rows every table would be having after the data is in
and I found
there is a table which will approximately have 5 crore entries after data
harvesting.
Is it advisable to keep so much data in one table ?
  I have read about 'partitioning' a table. An other idea I have is
to break the table into
different tables after the no of rows  in a table has reached a certain
limit say 10 lacs.
For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
each having 10 lac entries.
I needed advice on whether I should go for partitioning or the approach I
have thought of.
  We have a HP server with 32GB ram,16 processors. The storage has
24TB diskspace (1TB/HD).
We have put them on RAID-5. It will be great if we could know the parameters
that can be changed in the
postgres configuration file so that the database makes maximum utilization
of the server we have.
For eg parameters that would increase the speed of inserts and selects.


Thank you in advance
Rajiv Nair


[PERFORM] query cost too high, anyway to reduce it

2009-12-05 Thread nair rajiv
--
TABLE STRUCTURE
--

CREATE TABLE gbobjects
(
  ssid bigint NOT NULL,
  nid character varying NOT NULL,
  inid bigint NOT NULL,
  uid bigint NOT NULL,
  status character varying,
  noofchanges integer NOT NULL,
  fieldschanged character varying[] NOT NULL,
  changetype bigint[] NOT NULL,
  noofcommits integer NOT NULL,
  noofchangesaftercommit integer NOT NULL,
  history bigint[] NOT NULL,
  gbtimestamp timestamp with time zone DEFAULT now(),
  rendered_nbh text,
  nbh text,
  CONSTRAINT gbobjects_pkey PRIMARY KEY (ssid)
)
WITH (OIDS=FALSE);
ALTER TABLE gbobjects OWNER TO postgres;


-- Index: nid_object

CREATE INDEX nid_object
  ON gbobjects
  USING btree
  (nid);


---
using EXPLAIN
---

We populated the table with data and used EXPLAIN


dbpedia=# EXPLAIN   SELECT   nid,max(ssid) FROM gbobjects  where ssid<=
10  group by nid  ;

  QUERY PLAN
--
 GroupAggregate  (cost=20966.03..22944.49 rows=98923 width=27)
   ->  Sort  (cost=20966.03..21213.34 rows=98923 width=27)
 Sort Key: nid
 ->  Index Scan using ssid_object on gbobjects  (cost=0.00..10388.88
rows=98923 width=27)
   Index Cond: (ssid <= 10)


Total rows : *875459 *


*The cost is very high. Is there a way to reduce the cost ?. We have kept
the
postgresql configuration files as it is i.e. they are the default
configuration
files.* Can the cost be reduced by changing some parameters in
postgresql.conf file. If yes which are those parameters ?

*Operating system used : ubuntu-9.04
postgresql version : 8.3
Ram : 2 GB
*

Thank you in advance
Rajiv nair