Re: [PERFORM] splitting data into multiple tables
On Tue, Jan 26, 2010 at 5:15 PM, Matthew Wakeling wrote: > On Mon, 25 Jan 2010, nair rajiv wrote: > >> I am working on a project that will take out structured content from >> wikipedia and put it in our database... >> >> there is a table which will approximately have 5 crore entries after data >> harvesting. >> > > Have you asked the Wikimedia Foundation if they mind you consuming that > much of their bandwidth, or even if there are copyright issues involved in > grabbing that much of their data? > We are downloading the nt and owl files kept for download at http://wiki.dbpedia.org/Downloads34 > (The other problem with using the word "crore" is that although it may mean > 1000 in a few countries, it could also mean 50.) > > Matthew > > -- > Of course it's your fault. Everything here's your fault - it says so in > your > contract.- Quark >
Re: [PERFORM] splitting data into multiple tables
On Tue, Jan 26, 2010 at 6:19 AM, Andres Freund wrote: > On Tuesday 26 January 2010 01:39:48 nair rajiv wrote: > > On Tue, Jan 26, 2010 at 1:01 AM, Craig James > wrote: > > I am working on a project that will take out structured content > > from wikipedia > > and put it in our database. Before putting the data into the database I > > wrote a script to > > find out the number of rows every table would be having after the data is > > in and I found > > there is a table which will approximately have 50,000,000 rows after data > > harvesting. > > Is it advisable to keep so much data in one table ? > Depends on your access patterns. I.e. how many rows are you accessing at > the > same time - do those have some common locality and such. > I'll give a brief idea of how this table is. The important columns are subject, predicate and object. So given a predicate and object one should be able to get all the subjects, given subject and a predicate one should be able to retrieve all the objects. I have created an indexes on these three columns. > > > > I have read about 'partitioning' a table. An other idea I have > is > > to break the table into > > different tables after the no of rows in a table has reached a certain > > limit say 10,00,000. > > For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' > > each having 10,00,000 rows. > > I needed advice on whether I should go for partitioning or the approach I > > have thought of. > Your approach is pretty close to partitioning - except that partitioning > makes > that mostly invisible to the outside so it is imho preferrable. > > > We have a HP server with 32GB ram,16 processors. The storage > has > > 24TB diskspace (1TB/HD). > > We have put them on RAID-5. It will be great if we could know the > > parameters that can be changed in the > > postgres configuration file so that the database makes maximum > utilization > > of the server we have. > > For eg parameters that would increase the speed of inserts and selects. > Not using RAID-5 possibly would be a good start - many people (me included) > experienced bad write performance on it. It depends a great deal on the > controller/implementation though. > RAID-10 is normally to be considered more advantageous despite its lower > usable space ratio. > Did you create one big RAID-5 out of all disks? Thats not a good idea, > because > its pretty likely that another disk fails while you restore a previously > failed disk. Unfortunately in that configuration that means you have lost > your > complete data (in the most common implementations at least). > No, I am using only 12TB i.e 12 HDs of the 24TB I have > > Andres > > PS: Your lines are strangely wrapped... >
Re: [PERFORM] splitting data into multiple tables
On Tue, Jan 26, 2010 at 1:01 AM, Craig James wrote: > Kevin Grittner wrote: > >> nair rajiv wrote: >> >> >>> I found there is a table which will approximately have 5 crore >>> entries after data harvesting. >>> Is it advisable to keep so much data in one table ? >>> >> That's 50,000,000 rows, right? >> > > You should remember that words like lac and crore are not English words, > and most English speakers around the world don't know what they mean. > Thousand, million, billion and so forth are the English words that everyone > knows. > Oh I am Sorry. I wasn't aware of that I repost my query with suggested changes. Hello, I am working on a project that will take out structured content from wikipedia and put it in our database. Before putting the data into the database I wrote a script to find out the number of rows every table would be having after the data is in and I found there is a table which will approximately have 50,000,000 rows after data harvesting. Is it advisable to keep so much data in one table ? I have read about 'partitioning' a table. An other idea I have is to break the table into different tables after the no of rows in a table has reached a certain limit say 10,00,000. For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10,00,000 rows. I needed advice on whether I should go for partitioning or the approach I have thought of. We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD). We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the postgres configuration file so that the database makes maximum utilization of the server we have. For eg parameters that would increase the speed of inserts and selects. Thank you in advance Rajiv Nair > > Craig >
[PERFORM] splitting data into multiple tables
Hello, I am working on a project that will take out structured content from wikipedia and put it in our database. Before putting the data into the database I wrote a script to find out the number of rows every table would be having after the data is in and I found there is a table which will approximately have 5 crore entries after data harvesting. Is it advisable to keep so much data in one table ? I have read about 'partitioning' a table. An other idea I have is to break the table into different tables after the no of rows in a table has reached a certain limit say 10 lacs. For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10 lac entries. I needed advice on whether I should go for partitioning or the approach I have thought of. We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD). We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the postgres configuration file so that the database makes maximum utilization of the server we have. For eg parameters that would increase the speed of inserts and selects. Thank you in advance Rajiv Nair
[PERFORM] query cost too high, anyway to reduce it
-- TABLE STRUCTURE -- CREATE TABLE gbobjects ( ssid bigint NOT NULL, nid character varying NOT NULL, inid bigint NOT NULL, uid bigint NOT NULL, status character varying, noofchanges integer NOT NULL, fieldschanged character varying[] NOT NULL, changetype bigint[] NOT NULL, noofcommits integer NOT NULL, noofchangesaftercommit integer NOT NULL, history bigint[] NOT NULL, gbtimestamp timestamp with time zone DEFAULT now(), rendered_nbh text, nbh text, CONSTRAINT gbobjects_pkey PRIMARY KEY (ssid) ) WITH (OIDS=FALSE); ALTER TABLE gbobjects OWNER TO postgres; -- Index: nid_object CREATE INDEX nid_object ON gbobjects USING btree (nid); --- using EXPLAIN --- We populated the table with data and used EXPLAIN dbpedia=# EXPLAIN SELECT nid,max(ssid) FROM gbobjects where ssid<= 10 group by nid ; QUERY PLAN -- GroupAggregate (cost=20966.03..22944.49 rows=98923 width=27) -> Sort (cost=20966.03..21213.34 rows=98923 width=27) Sort Key: nid -> Index Scan using ssid_object on gbobjects (cost=0.00..10388.88 rows=98923 width=27) Index Cond: (ssid <= 10) Total rows : *875459 * *The cost is very high. Is there a way to reduce the cost ?. We have kept the postgresql configuration files as it is i.e. they are the default configuration files.* Can the cost be reduced by changing some parameters in postgresql.conf file. If yes which are those parameters ? *Operating system used : ubuntu-9.04 postgresql version : 8.3 Ram : 2 GB * Thank you in advance Rajiv nair