[HACKERS] Performance while loading data and indexing

2002-09-26 Thread Shridhar Daithankar

Hello all,

Some time back I posted a query to build a site with 150GB of database. In last 
couple of weeks, lots of things were tested at my place and there are some 
results and again some concerns. 

This is a long post. Please be patient and read thr. If we win this, I guess we 
have a good marketing/advocacy  case here..;-)

First the problems (For those who do not read beyond first page)

1) Database load time from flat file using copy is very high
2) Creating index takes huge amount of time.
3) Any suggsestions for runtime as data load and query will be going in 
parallel.

Now the details. Note that this is a test run only..

Platform:- 4x Xeon2.4GHz/4GB RAM/4x48 SCSI RAID5/72 GB SCSI
RedHat7.2/PostgreSQL7.1.3

Database in flat file: 
125,000,000 records of around 100 bytes each. 
Flat file size 12GB

Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
Create unique composite index on 2 char and a timestamp field:  25226 sec.
Database size on disk: 26GB
Select query: 1.5 sec. for approx. 150 rows.

Important postgresql.conf settings

sort_mem = 12000
shared_buffers = 24000
fsync=true (Sad but true. Left untouched.. Will that make a difference on 
SCSI?)
wal_buffers = 65536 
wal_files = 64 

Now the requirements

Initial flat data load: 250GB of data. This has gone up since last query. It 
was 150GB earlier..
Ongoing inserts: 5000/sec. 
Number of queries: 4800 queries/hour
Query response time: 10 sec.


Now questions.

1)  Instead of copying from a single 12GB data file,  will a parallel copy from 
say 5 files will speed up the things? 

Couple MB of data per sec. to disk is just not saturating it. It's a RAID 5 
setup..

2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further 
addition to improve create index performance?

3) 5K concurrent inserts with an index on, will this need a additional CPU 
power? Like deploying it on dual RISC CPUs etc? 

4) Query performance is not a problem. Though 4.8K queries per sec. expected 
response time from each query is 10 sec. But my guess is some serius CPU power 
will be chewed there too..

5)Will upgrading to 7.2.2/7.3 beta help?

All in all, in the  test, we didn't see the performance where hardware is 
saturated to it's limits. So effectively we are not able to get postgresql 
making use of it. Just pushing WAL and shared buffers does not seem to be the 
solution.

If you guys have any suggestions. let me know.  I need them all..

Mysql is almost out because it's creating index for last 17 hours. I don't 
think it will keep up with 5K inserts per sec. with index. SAP DB is under 
evaluation too. But postgresql is most favourite as of now because it works. So 
I need to come up with solutions to problems that will occur in near future..
;-)

TIA..

Bye
 Shridhar

--
Law of Procrastination: Procrastination avoids boredom; one never has   the 
feeling that there is nothing important to do.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Performance while loading data and indexing

2002-09-26 Thread Shridhar Daithankar

On 26 Sep 2002 at 14:05, Shridhar Daithankar wrote:
> Some time back I posted a query to build a site with 150GB of database. In last 
> couple of weeks, lots of things were tested at my place and there are some 
> results and again some concerns. 

> 2) Creating index takes huge amount of time.
> Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
> Create unique composite index on 2 char and a timestamp field:  25226 sec.
> Database size on disk: 26GB
> Select query: 1.5 sec. for approx. 150 rows.

> 2) Sort mem.=12K i.e. 94MB, sounds good enough to me. Does this need further 
> addition to improve create index performance?

Just a thought. If I sort the table before making an index, would it be faster 
than creating index on raw table? And/or if at all, how do I sort the table 
without duplicating it?

Just a wild thought..

Bye
 Shridhar

--
linux: the choice of a GNU generation([EMAIL PROTECTED] put this on Tshirts in 
'93)


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Performance while loading data and indexing

2002-09-26 Thread Shridhar Daithankar

On 26 Sep 2002 at 10:51, [EMAIL PROTECTED] wrote:

> Hi,
> it seems you have to cluster it, I don't think you have another choise.

Hmm.. That didn't occur to me...I guess some real time clustering like usogres 
would do. Unless it turns out to be a performance hog..

But this is just insert and select. No updates no deletes(Unless customer makes 
a 180 degree turn) So I doubt if clustering will help. At the most I can 
replicate data across machines and spread queries on them. Replication overhead 
as a down side and low query load on each machine as upside..

> I'm retrieving the configuration of our postgres servers (I'm out of office
> now), so I can send it to you. I was quite disperate about performance, and
> I was thinking to migrate the data on an oracle database. Then I found this
> configuration on the net, and I had a succesfull increase of performance.

In this case, we are upto postgresql because we/our customer wants to keep the 
costs down..:-) Even they are asking now if it's possible to keep hardware 
costs down as well. That's getting some funny responses here but I digress..

> Maybe this can help you.
> 
> Why you use copy to insert records? I usually use perl scripts, and they
> work well .

Performance reasons. As I said in one of my posts earlier, putting upto 100K 
records in one transaction in steps of 10K did not reach performance of copy. 
As Tom said rightly, it was a 4-1 ratio despite using transactions..

Thanks once again..
Bye
 Shridhar

--
Secretary's Revenge:Filing almost everything under "the".


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] Performance while loading data and indexing

2002-09-26 Thread Shridhar Daithankar

On 26 Sep 2002 at 11:50, Mario Weilguni wrote:

> >Well the test runs were for 10GB of data. Schema is attached. Read in fixed 
> >fonts..Last nullable fields are dummies but may be used in fututre and
> varchars 
> >are not acceptable(Not my requirement). Tuple size is around 100 bytes..
> >The index creation query was
> >
> >CREATE INDEX index1 ON tablename (esn,min,datetime);
> >
> >What if I put datetime ahead? It's likely the the datetime field will have
> high 
> >degree of locality being log data..
> 
> Just an idea, I noticed you use char(10) for esn and min, and use this as
> index. Are these really fixed len fields all having 10 bytes? Otherwise
> varchar(10) would be better, because your tables, and especially the indices
> will be probably much smaller.
> 
> what average length do you have for min and esn?

10 bytes. Those are id numbers.. like phone numbers always have all the digits 
filled in..



Bye
 Shridhar

--
Bradley's Bromide:  If computers get too powerful, we can organize  them into a 
committee -- that will do them in.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Performance while loading data and indexing

2002-09-26 Thread Mario Weilguni

On 26 Sep 2002 at 14:05, Shridhar Daithankar wrote:
> Some time back I posted a query to build a site with 150GB of database. In
last 
> couple of weeks, lots of things were tested at my place and there are some 
> results and again some concerns. 

> 2) Creating index takes huge amount of time.
> Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
> Create unique composite index on 2 char and a timestamp field:  25226 sec.
> Database size on disk: 26GB
> Select query: 1.5 sec. for approx. 150 rows.

I never tried 150GB of data, but 10GB of data, and this worked fine for me. 
Maybe it will help if you post your table schema, including which indexes you
use, and the average size of one tuple.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Performance while loading data and indexing

2002-09-26 Thread Shridhar Daithankar

On 26 Sep 2002 at 11:17, Mario Weilguni wrote:

> On 26 Sep 2002 at 14:05, Shridhar Daithankar wrote:
> > Some time back I posted a query to build a site with 150GB of database. In
> last 
> > couple of weeks, lots of things were tested at my place and there are some 
> > results and again some concerns. 
> 
> > 2) Creating index takes huge amount of time.
> > Load time: 14581 sec/~8600 rows persec/~ an MB of data per sec.
> > Create unique composite index on 2 char and a timestamp field:  25226 sec.
> > Database size on disk: 26GB
> > Select query: 1.5 sec. for approx. 150 rows.
> 
> I never tried 150GB of data, but 10GB of data, and this worked fine for me. 
> Maybe it will help if you post your table schema, including which indexes you
> use, and the average size of one tuple.

Well the test runs were for 10GB of data. Schema is attached. Read in fixed 
fonts..Last nullable fields are dummies but may be used in fututre and varchars 
are not acceptable(Not my requirement). Tuple size is around 100 bytes..

The index creation query was

CREATE INDEX index1 ON tablename (esn,min,datetime);

What if I put datetime ahead? It's likely the the datetime field will have high 
degree of locality being log data..

Bye
 Shridhar

--
brain, v: [as in "to brain"]To rebuke bluntly, but not pointedly; to dispel a 
source  of error in an opponent.-- Ambrose Bierce, "The Devil's 
Dictionary"




Field Name  Field Type  NullableIndexed
typeint no  no
esn char (10)   no  yes
min char (10)   no  yes
datetimetimestamp   no  yes
opc0char (3)no  no
opc1char (3)no  no
opc2char (3)no  no
dpc0char (3)no  no
dpc1char (3)no  no
dpc2char (3)no  no
npa char (3)no  no
nxx char (3)no  no
restchar (4)no  no
field0  int yes no
field1  char (4)yes no
field2  int yes no
field3  char (4)yes no
field4  int yes no
field5  char (4)yes no
field6  int yes no
field7  char (4)yes no
field8  int yes no
field9  char (4)yes no




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Performance while loading data and indexing

2002-09-26 Thread Mario Weilguni

>Well the test runs were for 10GB of data. Schema is attached. Read in fixed 
>fonts..Last nullable fields are dummies but may be used in fututre and
varchars 
>are not acceptable(Not my requirement). Tuple size is around 100 bytes..
>The index creation query was
>
>CREATE INDEX index1 ON tablename (esn,min,datetime);
>
>What if I put datetime ahead? It's likely the the datetime field will have
high 
>degree of locality being log data..

Just an idea, I noticed you use char(10) for esn and min, and use this as
index. Are these really fixed len fields all having 10 bytes? Otherwise
varchar(10) would be better, because your tables, and especially the indices
will be probably much smaller.

what average length do you have for min and esn?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Performance while loading data and indexing

2002-09-26 Thread Tom Lane

"Shridhar Daithankar" <[EMAIL PROTECTED]> writes:
> On 26 Sep 2002 at 11:50, Mario Weilguni wrote:
>> Just an idea, I noticed you use char(10) for esn and min, and use this as
>> index. Are these really fixed len fields all having 10 bytes?

> 10 bytes. Those are id numbers.. like phone numbers always have all the digits 
> filled in..

If they are numbers, can you store them as bigints instead of char(N)?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [GENERAL] [HACKERS] Performance while loading data and indexing

2002-09-26 Thread Martijn van Oosterhout

On Thu, Sep 26, 2002 at 03:01:35PM +0530, Shridhar Daithankar wrote:
Content-Description: Mail message body
> The index creation query was
> 
> CREATE INDEX index1 ON tablename (esn,min,datetime);
> 
> What if I put datetime ahead? It's likely the the datetime field will have high 
> degree of locality being log data..

The order of fields depends on what you're using it for. For example, you
can use the above index for a query using the conditions:

esn = 'aaa' 
esn = 'bbb' and min = 'xxx'

but not for queries with only

datetime = '2002-09-26'
min = 'ddd' and datetime = '2002-10-02'

The fields can only be used left to right. This is where a single
multicolumn index differs from multiple indexes of different columns.

Have you used EXPLAIN ANALYSE to determine whether your indexes are being
used optimally?

-- 
Martijn van Oosterhout   <[EMAIL PROTECTED]>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [GENERAL] [HACKERS] Performance while loading data and indexing

2002-09-26 Thread scott.marlowe

If you are seeing very slow performance on a drive set, check dmesg to see 
if you're getting SCSI bus errors or something similar.  If your drives 
aren't properly terminated then the performance will suffer a great deal.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly