Re: [GENERAL] Performance issues when the number of records are around 10 Million

2010-05-11 Thread Brian Modra
On 11/05/2010, Sergey Konoplev  wrote:
> On 11 May 2010 10:18, venu madhav  wrote:
>> select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
>> e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
>> e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE
>> s.sig_id = e.signature   AND e.timestamp >= '1270449180' AND
>> e.timestamp < '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21
>> offset 10539780;
>>
>> Can any one suggest me a better solution to improve the performance.
>> Please let me know if you've any further queries.
>
> 1. Avoid using large OFFSETs. Do instead "... ORDER BY e.cid, e.cid LIMIT
> 21;"
> 2. What "EXPLAIN SELECT ..." shows?
> 3. What "\d event" prints?
>
> --
> Sergey Konoplev
>
> Blog: http://gray-hemp.blogspot.com /
> Linkedin: http://ru.linkedin.com/in/grayhemp /
> JID/GTalk: gray...@gmail.com / Skype: gray-hemp / ICQ: 29353802
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Have you checked if the indexes don't have lots of dead references?
Try to create new indexes, and then delete the old indexes (or just
use reindex if this is not an online database in production).

-- 
Brian Modra   Land line: +27 23 5411 462
Mobile: +27 79 69 77 082
5 Jan Louw Str, Prince Albert, 6930
Postal: P.O. Box 2, Prince Albert 6930
South Africa
http://www.zwartberg.com/

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Performance issues when the number of records are around 10 Million

2010-05-11 Thread A. Kretschmer
In response to venu madhav :
> Hi all,
>In my database application, I've a table whose records can
> reach 10M and insertions can happen at a faster rate like 100
> insertions per second in the peak times. I configured postgres to do
> auto vacuum on hourly basis. I have frontend GUI application in CGI
> which displays the data from the database.
> When I try to get the last twenty records from the database,
> it takes around 10-15 mins to complete the operation.This is the query
> which is used:
> 
> select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
> e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
> e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE
> s.sig_id = e.signature   AND e.timestamp >= '1270449180' AND
> e.timestamp < '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21
> offset 10539780;

First, show us the table-definition for both tables.
Secondly the output generated from EXPLAIN ANALYSE 

I'm surprised about the "e.timestamp >= '1270449180'", is this a
TIMESTAMP-column? 

And, to retrieve the last twenty records you should write:

ORDER BY ts DESC LIMIT 20


With a proper index on this column this should force an index-scan.


Andreas
-- 
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Performance issues when the number of records are around 10 Million

2010-05-10 Thread Sergey Konoplev
On 11 May 2010 10:18, venu madhav  wrote:
> select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
> e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
> e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE
> s.sig_id = e.signature   AND e.timestamp >= '1270449180' AND
> e.timestamp < '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21
> offset 10539780;
>
> Can any one suggest me a better solution to improve the performance.
> Please let me know if you've any further queries.

1. Avoid using large OFFSETs. Do instead "... ORDER BY e.cid, e.cid LIMIT 21;"
2. What "EXPLAIN SELECT ..." shows?
3. What "\d event" prints?

-- 
Sergey Konoplev

Blog: http://gray-hemp.blogspot.com /
Linkedin: http://ru.linkedin.com/in/grayhemp /
JID/GTalk: gray...@gmail.com / Skype: gray-hemp / ICQ: 29353802

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Performance issues when the number of records are around 10 Million

2010-05-10 Thread venu madhav
Hi all,
   In my database application, I've a table whose records can
reach 10M and insertions can happen at a faster rate like 100
insertions per second in the peak times. I configured postgres to do
auto vacuum on hourly basis. I have frontend GUI application in CGI
which displays the data from the database.
When I try to get the last twenty records from the database,
it takes around 10-15 mins to complete the operation.This is the query
which is used:

select e.cid, timestamp, s.sig_class, s.sig_priority, s.sig_name,
e.sniff_ip, e.sniff_channel, s.sig_config, e.wifi_addr_1,
e.wifi_addr_2, e.view_status, bssid  FROM event e, signature s WHERE
s.sig_id = e.signature   AND e.timestamp >= '1270449180' AND
e.timestamp < '1273473180'  ORDER BY e.cid DESC,  e.cid DESC limit 21
offset 10539780;

Can any one suggest me a better solution to improve the performance.
Please let me know if you've any further queries.


Thank you,
Venu

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] performance issues on windows with 8.3.0?

2008-02-15 Thread Dan Armbrust
Thanks for all the help.  Performance is back where I thought it
should be, after I fixed our pooling bug.

I didn't think that postgres would be released with performance issues
like that - its just too good :)

Thanks,

Dan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] performance issues on windows with 8.3.0?

2008-02-15 Thread Dave Page
On Thu, Feb 14, 2008 at 7:56 PM, Dan Armbrust
<[EMAIL PROTECTED]> wrote:
> On Thu, Feb 14, 2008 at 1:31 PM, Dave Page <[EMAIL PROTECTED]> wrote:
>
> > You must have enabled the debugger when you installed (or didn't
>  >  disable it). You can turn it back off in postgresql.conf if you like -
>  >  there may be a little overhead.
>
>  I see this in the postgresql.conf file:
>  shared_preload_libraries = '$libdir/plugins/plugin_debugger.dll' 
>#
>  (change requires restart)

Yup, just set it blank and restart.

>  I didn't turn that on, perhaps I missed it in the installer, or it
>  defaults to on in the installer.  I commented it out, and now my
>  performance on 8.3 on windows is in line with what I am seeing with
>  8.2.6 on windows.  That is a lot of overhead, when a connection is
>  made (I realize I shouldn't be making connections this often - but I
>  wonder if the overhead is only at connection time, or if there is
>  other overhead as well)

I'm surprised there's so much overhead, but not that there is some.
Any runtime overhead will be in pl/pgsql functions so if you're not
using any, you won't see any difference once connected.

>  >  However, the fact that it keeps appearing implies you're using lots of
>  >  new (short-lived?) connections. That's particularly expensive on
>  >  Windows - consider a connection pooler, of if you're using something
>  >  like php, persistent connections.
>  >
>  Your right, my connections are dropping off left and right.  My
>  performance on windows is about 75% slower than the performance on
>  linux.  It appears that I'm currently dropping connections on both
>  windows and linux.  Does that fall in line with your expectation that
>  creating connections on windows is particularly expensive?

I couldn't quote a figure, but I'm not surprised it's noticeably slower.

>  I'll go figure out why on earth my connections are getting killed and
>  recreated by the pooling layers.

Sounds like a good plan :-)

-- 
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com
The Oracle-compatible database company

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] performance issues on windows with 8.3.0?

2008-02-14 Thread Magnus Hagander

Dan Armbrust wrote:

Are there any known regression issues WRT performance on the 8.3.0.1
binary build for windows?

And I mean serious -multiple orders of magnitude- performance issues
running simple queries on a small database...

A little more background.  I built 8.3.0 on Cent OS 5 today.  Started
using it with a server application that I have to do some
benchmarking.  Java application connecting via jdbc to Postgres on
localhost.  Everything looked great.  Perhaps even faster than 8.2 -
but I don't have the benchmarks to say for sure yet.

Then, I wanted to test what is usually our fastest performing setup -
with the database on a second system.  So, I installed the new 8.3 on
a Windows 2003 system, started it up, created my database, pointed my
java app to it and everything seemed good.

Then I put a load on it - and my server thru-put went from about 1500
messages per second on the linux-localhost combination to 30 on the
remote windows postgres installation.

I will be investigating this much more completely first thing tomorrow
- but I figured I'd ask to see if there was something obvious first...


How's the performance compared to 8.2 on the same machine? From what 
you're saying, I take it you already have that there?


If not, you need to look at a whole lot of more things. It *will* be 
slower on windows than on linux, and significantly so, but it shouldn't 
be several orders of magnitude. But you need to narrow down the testing 
a bit - for example, one of your tests runs over the network and one 
runs locally, so that could be where your issues is. It could also be 
that your harddrives under linux is lying about fsync, whereas the 
windows version will by default write through such write caches even if 
they exist. There are many different things that could be the reason here.


//Magnus

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] performance issues on windows with 8.3.0?

2008-02-14 Thread Dan Armbrust
So, my ill Postgres 8.3 database is filling up log files in the pg_log
directory with the following:

2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"
2008-02-13 18:29:31 CST LOG:  loaded library
"$libdir/plugins/plugin_debugger.dll"


Does this mean anything to anyone?

I can't hardly believe that I'm the first one to notice that the
windows build of 8.3 has serious issues.

Thanks,

Dan

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[GENERAL] performance issues on windows with 8.3.0?

2008-02-13 Thread Dan Armbrust
Are there any known regression issues WRT performance on the 8.3.0.1
binary build for windows?

And I mean serious -multiple orders of magnitude- performance issues
running simple queries on a small database...

A little more background.  I built 8.3.0 on Cent OS 5 today.  Started
using it with a server application that I have to do some
benchmarking.  Java application connecting via jdbc to Postgres on
localhost.  Everything looked great.  Perhaps even faster than 8.2 -
but I don't have the benchmarks to say for sure yet.

Then, I wanted to test what is usually our fastest performing setup -
with the database on a second system.  So, I installed the new 8.3 on
a Windows 2003 system, started it up, created my database, pointed my
java app to it and everything seemed good.

Then I put a load on it - and my server thru-put went from about 1500
messages per second on the linux-localhost combination to 30 on the
remote windows postgres installation.

I will be investigating this much more completely first thing tomorrow
- but I figured I'd ask to see if there was something obvious first...

Thanks,

Dan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] Performance Issues

2007-09-27 Thread Peter Childs
On 23/09/2007, Alvaro Herrera <[EMAIL PROTECTED]> wrote:
>
> Christian Schröder wrote:
> > Alvaro Herrera wrote:
> >> Christian Schröder wrote:
> >>
> >>
> >>> I think it is my job as db admin to make the database work the way my
> >>> users need it, and not the user's job to find a solution that fits the
> >>> database's needs ...
> >>>
> >>> Is there really nothing that I can do?
> >>>
> >>
> >> You can improve the selectivity estimator function.  One idea is that
> if
> >> you are storing something that's not really a general character string,
> >> develop a specific datatype, with a more precise selectivity estimator.
> >> If you are you up to coding in C, that is.
> >>
> >
> > Hm, that sounds interesting! I will definitely give it a try.
> > Will that also solve the problem of combining more than one of these
> > conditions? As far as I can see, the main issue at the moment is that we
> > often have "... where test like '11%' and test not like '113%'" in our
> > queries. Even if the selectivity estimation of the single condition will
> be
> > improved, it will still be wrong to multiply the selectivities.
>
> Unless you can come up with an operator that expresses better the
> "starts with 11 but not with 113" type of condition.  For example if
> these were telephone number prefixes or something like that, probably
> there's some way to do that in a single operation instead of two, and
> the selectivity function could produce a much more accurate estimate
> saving the need to multiply.



select a from b where a ~ '^11[^3]'

Is that what you want?

I usually find using ~ far better than like.

Peter Childs


--
> Alvaro Herrera
> http://www.advogato.org/person/alvherre
> "I think my standards have lowered enough that now I think 'good design'
> is when the page doesn't irritate the living f*ck out of me." (JWZ)
>
> ---(end of broadcast)---
> TIP 1: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to [EMAIL PROTECTED] so that your
>message can get through to the mailing list cleanly
>


Re: [GENERAL] Performance Issues

2007-09-23 Thread Alvaro Herrera
Christian Schröder wrote:
> Alvaro Herrera wrote:
>> Christian Schröder wrote:
>>
>>   
>>> I think it is my job as db admin to make the database work the way my 
>>> users need it, and not the user's job to find a solution that fits the 
>>> database's needs ...
>>>
>>> Is there really nothing that I can do?
>>> 
>>
>> You can improve the selectivity estimator function.  One idea is that if
>> you are storing something that's not really a general character string,
>> develop a specific datatype, with a more precise selectivity estimator.
>> If you are you up to coding in C, that is.
>>   
>
> Hm, that sounds interesting! I will definitely give it a try.
> Will that also solve the problem of combining more than one of these 
> conditions? As far as I can see, the main issue at the moment is that we 
> often have "... where test like '11%' and test not like '113%'" in our 
> queries. Even if the selectivity estimation of the single condition will be 
> improved, it will still be wrong to multiply the selectivities.

Unless you can come up with an operator that expresses better the
"starts with 11 but not with 113" type of condition.  For example if
these were telephone number prefixes or something like that, probably
there's some way to do that in a single operation instead of two, and
the selectivity function could produce a much more accurate estimate
saving the need to multiply.

-- 
Alvaro Herrerahttp://www.advogato.org/person/alvherre
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Performance Issues

2007-09-21 Thread Christian Schröder

Alvaro Herrera wrote:

Christian Schröder wrote:

  
I think it is my job as db admin to make the database work the way my users 
need it, and not the user's job to find a solution that fits the database's 
needs ...


Is there really nothing that I can do?



You can improve the selectivity estimator function.  One idea is that if
you are storing something that's not really a general character string,
develop a specific datatype, with a more precise selectivity estimator.
If you are you up to coding in C, that is.
  


Hm, that sounds interesting! I will definitely give it a try.
Will that also solve the problem of combining more than one of these 
conditions? As far as I can see, the main issue at the moment is that we 
often have "... where test like '11%' and test not like '113%'" in our 
queries. Even if the selectivity estimation of the single condition will 
be improved, it will still be wrong to multiply the selectivities.


I think I will have a look at the src/backend/optimizer/util/plancat.c, 
src/backend/optimizer/path/clausesel.c and 
src/backend/utils/adt/selfuncs.c files after my holiday.


Kind regards,
   Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax:  +49 551 489500-91
Hans-Böckler-Straße 2  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Performance Issues

2007-09-21 Thread Alvaro Herrera
Christian Schröder wrote:

> I think it is my job as db admin to make the database work the way my users 
> need it, and not the user's job to find a solution that fits the database's 
> needs ...
>
> Is there really nothing that I can do?

You can improve the selectivity estimator function.  One idea is that if
you are storing something that's not really a general character string,
develop a specific datatype, with a more precise selectivity estimator.
If you are you up to coding in C, that is.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Performance Issues

2007-09-20 Thread Christian Schröder

John D. Burger wrote:

Christian Schröder wrote:

Or would it be possible to tweak how the planner determines the 
selectivity? I have read in the docs (chapter 54.1) that in case of 
more than one condition in the where clause, independency is assumed. 
In my case ("... where test like '11%' and test not like '113%'") 
this is clearly not the case, so it might be an interesting point to 
address.


I think the planner does think about the interactions of inequalities, 
so if you can express your query with less-than and friends, or even 
with BETWEEN, you might get a better plan.  I don't know the details 
of your setup, but you can do things like this with any ordered type:


   where test between '11' and '113'
or test >= '114'

I know this does not match the exact semantics of your query, but 
hopefully you get the idea.


There are two drawbacks of this solution:

  1. It is not always possible to rewrite the "like" or "substring"
 queries with standard relational operators.
  2. It is annoying for my users that they have to tewak the query
 until they find a solution that takes 5 seconds to finish instead
 of 4 hours.

I think it is my job as db admin to make the database work the way my 
users need it, and not the user's job to find a solution that fits the 
database's needs ...


Is there really nothing that I can do?

Regards,
   Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax:  +49 551 489500-91
Hans-Böckler-Straße 2  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] Performance Issues (was: "like" vs "substring" again)

2007-09-18 Thread John D. Burger

Christian Schröder wrote:

Or would it be possible to tweak how the planner determines the  
selectivity? I have read in the docs (chapter 54.1) that in case of  
more than one condition in the where clause, independency is  
assumed. In my case ("... where test like '11%' and test not like  
'113%'") this is clearly not the case, so it might be an  
interesting point to address.


I think the planner does think about the interactions of  
inequalities, so if you can express your query with less-than and  
friends, or even with BETWEEN, you might get a better plan.  I don't  
know the details of your setup, but you can do things like this with  
any ordered type:


   where test between '11' and '113'
or test >= '114'

I know this does not match the exact semantics of your query, but  
hopefully you get the idea.


- John D. Burger
  MITRE
---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


[GENERAL] Performance Issues (was: "like" vs "substring" again)

2007-09-18 Thread Christian Schröder

Hi list,
I am still fighting with the really slow database queries (see 
http://www.nabble.com/%22like%22-vs-%22substring%22-again-t4447906.html), 
and I still believe that the cause of the problem is that the query 
planner makes incorrect estimations about the selectivity of the "where" 
clauses.
I wondered if it is possible to make the query planner perform a 
sequential scan over a table *before* it starts planning? If I know that 
a table has only about 3000 rows, the overhead due to this sequential 
scan can be ignored. On the other hand, this would give the planner an 
exact data basis for his planning.
Or would it be possible to tweak how the planner determines the 
selectivity? I have read in the docs (chapter 54.1) that in case of more 
than one condition in the where clause, independency is assumed. In my 
case ("... where test like '11%' and test not like '113%'") this is 
clearly not the case, so it might be an interesting point to address.

Do you have any other tips for me?

Kind regards,
   Christian

--
Deriva GmbH Tel.: +49 551 489500-42
Financial IT and Consulting Fax:  +49 551 489500-91
Hans-Böckler-Straße 2  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-16 Thread Brian Wipf

On 16-May-07, at 4:05 PM, PFC wrote:

This makes queries hard to optimize. Consider the table (user_id,  
item_id) meaning user selected this item as favourite.
	If you want to know which users did select both items 1 and 2, you  
have to do a self-join, something like :


SELECT... FROM favourites a, favourites b WHERE a.user_id =  
b.user_id AND a.item_id=1 AND b.item_id = 2


You could get users who have selected both items 1 and 2 without  
doing a self-join with a query like the following:


select user_id from favourite where item_id = 1 or item_id = 2 group  
by user_id having count(*) = 2;

Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-16 Thread PFC
We use bitfields on our large user table.  It is becoming unworkable to  
scan for matches, since overall most people have very few selections  
made.


We are moving it to a model like your favorite_colors table which just  
links the option and the user.


We find that doing joins on large tables which can be indexed to avoid  
full table scans are very fast in postgres, since the index can do much  
of your culling of potential matching rows.  With bitfields, you are  
more or less forced into doing a sequence scan to find everyone who  
likes the color red.


Of course, if you're playing with only a few thousand users, either  
approach works well.


Things you could try :

* Use an integer array instead of a bitfield
	(for instance, in users table, column favourites would contain { 1,2 } if  
the user selected items 1 and 2 )
	Then, you can make a Gist index on it and use the indexed intersection  
operator


	This is likely the optimal solution if the maximum number of items is  
small (say, 100 is good, 10 is not)


* keep your bitfields and create conditional indexes :

CREATE INDEX ... WHERE bitfield_column & 1;
CREATE INDEX ... WHERE bitfield_column & 2;
CREATE INDEX ... WHERE bitfield_column & 4;
CREATE INDEX ... WHERE bitfield_column & 8;
CREATE INDEX ... WHERE bitfield_column & 16;
etc...

	Obviously this will only work if you have, say, 10 favouritess. 100  
indexes on a table would really suck.


Then, when looking for users who chose bits 1 and 2, do :
SELECT WHERE (bitfield_column & 1) AND (bitfield_column & 2)
postgres will do a bitmap-and using the two indexes

(note : when we get bitmap indexes, this will be even better)

* use tsearch2 :

favourites = 'red blue'
and fulltext-search it

* use a favourites table :

	This makes queries hard to optimize. Consider the table (user_id,  
item_id) meaning user selected this item as favourite.
	If you want to know which users did select both items 1 and 2, you have  
to do a self-join, something like :


SELECT... FROM favourites a, favourites b WHERE a.user_id = b.user_id AND  
a.item_id=1 AND b.item_id = 2


	This is likely to be "not very fast" if 1 million users check each option  
but only 100 check both.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-16 Thread Vivek Khera


On May 15, 2007, at 10:35 AM, Bill Moseley wrote:


For some value of "large", is there a time when one might consider
using a single column in the user or user_prefs table to represent
their color choices instead of a link table?


We use bitfields on our large user table.  It is becoming unworkable  
to scan for matches, since overall most people have very few  
selections made.


We are moving it to a model like your favorite_colors table which  
just links the option and the user.


We find that doing joins on large tables which can be indexed to  
avoid full table scans are very fast in postgres, since the index can  
do much of your culling of potential matching rows.  With bitfields,  
you are more or less forced into doing a sequence scan to find  
everyone who likes the color red.


Of course, if you're playing with only a few thousand users, either  
approach works well.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-16 Thread Vivek Khera


On May 14, 2007, at 4:37 PM, Bill Moseley wrote:


Say that there's also about 10 columns of settings or preferences for
each user.  Are there any cases or reasons to have a separate
"user_preferences" table vs. just placing all the columns together in
one table?


when you have, say 65 million users, it makes sense to push the  
ancillary info to another table to keep from having to copy too much  
data when you update the main info (like last access time).


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread Bill Moseley
On Tue, May 15, 2007 at 09:20:53PM +0200, PFC wrote:
> >From an outside perspective it just
> >seems odd that potentially a large amount of data would be pulled off
> >disk into memory that is never used.  Perhaps there's an overriding
> >reason for this.
> 
>   Yeah, where would you put this data if you didn't put it where it is 
>   now ?

Swish-e isn't a database by any means, but it does have a way to store
column like meta data for each "row".  When it does a search it only
explicitly pulls from disk the meta data that it's asked to return.
Granted, the OS is reading from disk more than the application is
asking for, but the application is only allocating memory for the data
it's going to return.  And the column (meta data) is not always stored
together on disk.

Without knowing Pg internals I wasn't aware of how the actual table
data was organized and fetched into memory.

>   "Premature optimization is the root of all evil"

Exactly what prompted this thread. ;)

-- 
Bill Moseley
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread PFC



Thus, if there are a whole bunch of columns on each table, the data in
those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.


Yeah, I wanted to mean that ;)
	All the columns are loaded (except the TOASTed ones which are not  
mentioned in the query) into memory, but only the requested ones are  
processed and returned to the client...



Is that specific to Postgresql?


Nope. All databases do more or less the same.


From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used.  Perhaps there's an overriding
reason for this.


Yeah, where would you put this data if you didn't put it where it is 
now ?


If you alter tables "customer" and "order", taking some columns off,
and stowing them in separate tables, then you'll find that more tuples
of "customer" and "order" will fit into a buffer page, and that the
join will be assembled with somewhat less memory usage.

Whether or not that is a worthwhile change to make will vary
considerably.


Makes designing the schema a bit tough. ;)


"Premature optimization is the root of all evil"

Build a test database, fill it with data, and experiment.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread Steve Atkins


On May 15, 2007, at 12:02 PM, Bill Moseley wrote:


On Tue, May 15, 2007 at 01:42:18PM -0400, Chris Browne wrote:

[EMAIL PROTECTED] (PFC) writes:

SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and  
customer?

Maybe that's not a good example due to indexes.


No, it just pulls the columns you ask from the table, nothing
less,  nothing more.


That's not quite 100% accurate.

In order to construct the join, the entire pages of the relevant
tuples in tables "order" and "customer" will need to be drawn into
memory.

Thus, if there are a whole bunch of columns on each table, the  
data in

those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.


Is that specific to Postgresql?  From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used.  Perhaps there's an overriding
reason for this.


The columns are next to each other on the disk. You need to read
the entire block off disk into system cache, so you'll be reading all
the columns of all the rows in that block into memory.

That's just the way that most (all?) modern filesystems work, and so the
way that most filesystem based databases are going to work. I've seen
some databases that don't store all the main columns of a table together
on disk, but they're fairly rare.

Pushing data into lookaside tables either manually or automatically
via toast changes the tradeoffs.

Cheers,
  Steve




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread John D. Burger
Thus, if there are a whole bunch of columns on each table, the  
data in

those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.


Is that specific to Postgresql?  From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used.  Perhaps there's an overriding
reason for this.


Anything else would seem odd to me.  Pulling a page into memory  
typically has OS support, and is thus very fast.  Picking and  
choosing bits and pieces to read would be prohibitively slow.   
Moreover, caching only those bits and pieces would require  
complicated code to decide whether the cached data is relevant to the  
next query.  Validating cached data at the page level is much  
simpler, and thus faster.


Or so I assume ...

- John D. Burger
  MITRE



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread Bill Moseley
On Tue, May 15, 2007 at 01:42:18PM -0400, Chris Browne wrote:
> [EMAIL PROTECTED] (PFC) writes:
> >> SELECT o.id
> >> FROM order o
> >> JOIN customer c on o.customer = c.id
> >>
> >> Does that bring into memory all columns from both order and customer?
> >> Maybe that's not a good example due to indexes.
> >
> > No, it just pulls the columns you ask from the table, nothing
> > less,  nothing more.
> 
> That's not quite 100% accurate.
> 
> In order to construct the join, the entire pages of the relevant
> tuples in tables "order" and "customer" will need to be drawn into
> memory.
> 
> Thus, if there are a whole bunch of columns on each table, the data in
> those extra columns (e.g. - all columns aside from "id", the one that
> was asked for in the result set) will indeed be drawn into memory.

Is that specific to Postgresql?  From an outside perspective it just
seems odd that potentially a large amount of data would be pulled off
disk into memory that is never used.  Perhaps there's an overriding
reason for this.

> If you alter tables "customer" and "order", taking some columns off,
> and stowing them in separate tables, then you'll find that more tuples
> of "customer" and "order" will fit into a buffer page, and that the
> join will be assembled with somewhat less memory usage.
> 
> Whether or not that is a worthwhile change to make will vary
> considerably.

Makes designing the schema a bit tough. ;)

-- 
Bill Moseley
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread SCassidy
One other possible reason for splitting the table up in two chunks is to 
grant different rights on the 2 sets of columns.

Susan Cassidy




Bill Moseley <[EMAIL PROTECTED]> 
Sent by: [EMAIL PROTECTED]
05/15/2007 09:44 AM

To
Postgres General 
cc

Subject
Re: [GENERAL] Performance issues of one vs. two split tables.







Sorry, I don't mean to drag this thread out much longer.  But, I have
one more question regarding joins.

Say I have a customer table and an order table.  I want a list of all
order id's for a given customer.

SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.

See, I've seen this splitting of one-to-one tables a number of time
(such as the user and user_preferences example) and I'm not sure if
that's just poor schema design, premature optimization, or someone
making smart use of their knowledge of the internal workings of
Postgresql



-- 
Bill Moseley
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/



--
Simply protected storage solutions ensure that your information is
automatically safe, readily available and always there, visit us at 
http://www.overlandstorage.com
--



Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread Chris Browne
[EMAIL PROTECTED] (PFC) writes:
>> SELECT o.id
>> FROM order o
>> JOIN customer c on o.customer = c.id
>>
>> Does that bring into memory all columns from both order and customer?
>> Maybe that's not a good example due to indexes.
>
>   No, it just pulls the columns you ask from the table, nothing
> less,  nothing more.

That's not quite 100% accurate.

In order to construct the join, the entire pages of the relevant
tuples in tables "order" and "customer" will need to be drawn into
memory.

Thus, if there are a whole bunch of columns on each table, the data in
those extra columns (e.g. - all columns aside from "id", the one that
was asked for in the result set) will indeed be drawn into memory.

They may not be drawn into the return set, but they will still be
drawn into memory.

If you alter tables "customer" and "order", taking some columns off,
and stowing them in separate tables, then you'll find that more tuples
of "customer" and "order" will fit into a buffer page, and that the
join will be assembled with somewhat less memory usage.

Whether or not that is a worthwhile change to make will vary
considerably.
-- 
(reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc"))
http://linuxdatabases.info/info/languages.html
There are two kinds of pedestrians -- the quick and the dead.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread PFC



SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.


	No, it just pulls the columns you ask from the table, nothing less,  
nothing more.


	Splitting tables (vertical partitioning) is used to reduce the size of  
the working set that has to fit in RAM... this is a different reason than  
what you're thinking about.


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread Bill Moseley
On Tue, May 15, 2007 at 07:51:44AM +0200, Dawid Kuroczko wrote:
> On 5/15/07, Bill Moseley <[EMAIL PROTECTED]> wrote:
> >On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote:
> >> Well, views are not going to help with memory consumption here.
> >> It is the table contents that gets cached in buffer cache, not the
> >> views contents.  So if you have a view which returns only one
> >> column from 15-column table, you will be caching that 15-column
> >> data nonetheless.  View, as the name states, is converted into
> >> a select on a real table.
> >
> >Are you saying that in Postgresql:
> >
> >select first_name, last_name from user_table;
> >
> >uses the same memory as this?
> >
> >select first_name, last_name,
> >passowrd, email,
> >[10 other columns]
> >from user_table;
> 
> Yes.  You read whole page (8KB) into buffer_cache,
> then extract these columns from these buffer.  From the
> buffer cache point of view, whole tuple is contained in the
> cache.

Sorry, I don't mean to drag this thread out much longer.  But, I have
one more question regarding joins.

Say I have a customer table and an order table.  I want a list of all
order id's for a given customer.

SELECT o.id
FROM order o
JOIN customer c on o.customer = c.id

Does that bring into memory all columns from both order and customer?
Maybe that's not a good example due to indexes.

See, I've seen this splitting of one-to-one tables a number of time
(such as the user and user_preferences example) and I'm not sure if
that's just poor schema design, premature optimization, or someone
making smart use of their knowledge of the internal workings of
Postgresql



-- 
Bill Moseley
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread Ben
I'm sure there's a point where you'd be saving a "substantial" amount  
of disk space using a non-normalized scheme, but, like you say, you'd  
be missing out on other things. In general, disks are cheap while the  
man hours used to try to fix data corruption is not.


On May 15, 2007, at 7:35 AM, Bill Moseley wrote:


Can anyone provide input on this question?  I'm curious how to look at
this from a disk and memory usage perspective.  Would using a bit
column type help much?

I'm not thrilled by the loss of referential integrity.

On Mon, May 14, 2007 at 01:37:18PM -0700, Bill Moseley wrote:


Say I have a table "color" that has about 20 different rows ("red",
"blue", "green", etc.).  I want the user to be able to select zero or
more favorite colors.  I would typically use a link table:

create table favorite_colors (
color   int references color(id),
userint references user(id)
);

Now, that table can have a large number of rows if I have a large
number of users and if everyone likes all the colors.

For some value of "large", is there a time when one might consider
using a single column in the user or user_prefs table to represent
their color choices instead of a link table?

table user_prefs (
...
favorite_colors bit varying,
...
);

Where each bit represents the primary key of the colors table.

Seems like poor design, but I'm wondering if there might be  
overriding

concerns at times.

For example, if I have 1 million users and they each like all colors
and thus have a 20 million row link table how much space would be
saved by using a bit column as above?


--
Bill Moseley
[EMAIL PROTECTED]


---(end of  
broadcast)---

TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-15 Thread Bill Moseley
Can anyone provide input on this question?  I'm curious how to look at
this from a disk and memory usage perspective.  Would using a bit
column type help much?

I'm not thrilled by the loss of referential integrity.

On Mon, May 14, 2007 at 01:37:18PM -0700, Bill Moseley wrote:
> 
> Say I have a table "color" that has about 20 different rows ("red",
> "blue", "green", etc.).  I want the user to be able to select zero or
> more favorite colors.  I would typically use a link table:
> 
> create table favorite_colors (
> color   int references color(id),
> userint references user(id)
> );
> 
> Now, that table can have a large number of rows if I have a large
> number of users and if everyone likes all the colors.
> 
> For some value of "large", is there a time when one might consider
> using a single column in the user or user_prefs table to represent
> their color choices instead of a link table?
> 
> table user_prefs (
> ...
> favorite_colors bit varying,
> ...
> );
> 
> Where each bit represents the primary key of the colors table.
> 
> Seems like poor design, but I'm wondering if there might be overriding
> concerns at times.
> 
> For example, if I have 1 million users and they each like all colors
> and thus have a 20 million row link table how much space would be
> saved by using a bit column as above?

-- 
Bill Moseley
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-14 Thread PFC




Say, if you first SELECT fname, lname FROM user_table;
and then you issue SELECT * FROM user_table; -- the
second select will be returned from buffer cache -- since
all rows are already in the cache.


	...Unless your table contains some large TEXT columns that have been  
stored out of line (TOASTed) by postgres.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-14 Thread Dawid Kuroczko

On 5/15/07, Bill Moseley <[EMAIL PROTECTED]> wrote:

On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote:
> Well, views are not going to help with memory consumption here.
> It is the table contents that gets cached in buffer cache, not the
> views contents.  So if you have a view which returns only one
> column from 15-column table, you will be caching that 15-column
> data nonetheless.  View, as the name states, is converted into
> a select on a real table.

Are you saying that in Postgresql:

select first_name, last_name from user_table;

uses the same memory as this?

select first_name, last_name,
passowrd, email,
[10 other columns]
from user_table;


Yes.  You read whole page (8KB) into buffer_cache,
then extract these columns from these buffer.  From the
buffer cache point of view, whole tuple is contained in the
cache.

Say, if you first SELECT fname, lname FROM user_table;
and then you issue SELECT * FROM user_table; -- the
second select will be returned from buffer cache -- since
all rows are already in the cache.

Having seperate caches for possible SELECT [column list]
would be well, not quite efficient.

Now, select fname,lname will take less private memory,
but this memory will be freed as soon as the query finishes,
but this won't help our cache much.

  Regards,
   Dawid

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-14 Thread Bill Moseley
On Tue, May 15, 2007 at 06:33:26AM +0200, Dawid Kuroczko wrote:
> Well, views are not going to help with memory consumption here.
> It is the table contents that gets cached in buffer cache, not the
> views contents.  So if you have a view which returns only one
> column from 15-column table, you will be caching that 15-column
> data nonetheless.  View, as the name states, is converted into
> a select on a real table.

Are you saying that in Postgresql:

select first_name, last_name from user_table;

uses the same memory as this?

select first_name, last_name,
passowrd, email,
[10 other columns]
from user_table;



-- 
Bill Moseley
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[GENERAL] Performance issues of one vs. two split tables.

2007-05-14 Thread Dawid Kuroczko

On 5/14/07, Rich Shepard <[EMAIL PROTECTED]> wrote:

On Mon, 14 May 2007, PFC wrote:

>   I did something like that on MySQL some time ago.
>   In the Users table there was stuff that other users need to see (like
> his login name, etc), and stuff that only this user needs to see (like his
> preferences).
>   So, when displaying posts in the forum, for instance, only a small
> part of the fields in the Users table was needed, the rest was just dead
> weight, that made the table unable to fit in RAM.
>   So I split the table, and it was faster.
>
>   However, you can also buy more RAM...

   Or, use Views without paying more.


Well, views are not going to help with memory consumption here.
It is the table contents that gets cached in buffer cache, not the
views contents.  So if you have a view which returns only one
column from 15-column table, you will be caching that 15-column
data nonetheless.  View, as the name states, is converted into
a select on a real table.

As for actual colors table -- you might consider keeping colors
as int[] (integer array) column, though I'll never admit I suggeted
you that. :D

  Regards,
   Dawid

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-14 Thread Rich Shepard

On Mon, 14 May 2007, PFC wrote:


I did something like that on MySQL some time ago.
	In the Users table there was stuff that other users need to see (like 
his login name, etc), and stuff that only this user needs to see (like his 
preferences).
	So, when displaying posts in the forum, for instance, only a small 
part of the fields in the Users table was needed, the rest was just dead 
weight, that made the table unable to fit in RAM.

So I split the table, and it was faster.

However, you can also buy more RAM...


  Or, use Views without paying more.

Rich

--
Richard B. Shepard, Ph.D.   |The Environmental Permitting
Applied Ecosystem Services, Inc.|  Accelerator(TM)
 Voice: 503-667-4517  Fax: 503-667-8863

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-14 Thread Bill Moseley
On Mon, May 14, 2007 at 10:52:13PM +0200, PFC wrote:
> 
> >Say I have a "user" table that has first, last, email, password, and
> >last_accessed columns.  This user table will be accessed often.  (It's
> >not really "user", but that's not important in this discussion)
> >
> >Say that there's also about 10 columns of settings or preferences for
> >each user.  Are there any cases or reasons to have a separate
> >"user_preferences" table vs. just placing all the columns together in
> >one table?
> 
>   I did something like that on MySQL some time ago.
>   In the Users table there was stuff that other users need to see 
>   (like his  login name, etc), and stuff that only this user needs to see 
> (like his  preferences).

>   So, when displaying posts in the forum, for instance, only a small 
>   part  of the fields in the Users table was needed, the rest was just 
> dead  
> weight, that made the table unable to fit in RAM.

Well, that's part of my question.  If not selecting those columns in
the common selects how much "dead weight" is brought along due to the
extra columns defined in the table, if any?

-- 
Bill Moseley
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Performance issues of one vs. two split tables.

2007-05-14 Thread PFC



Say I have a "user" table that has first, last, email, password, and
last_accessed columns.  This user table will be accessed often.  (It's
not really "user", but that's not important in this discussion)

Say that there's also about 10 columns of settings or preferences for
each user.  Are there any cases or reasons to have a separate
"user_preferences" table vs. just placing all the columns together in
one table?


I did something like that on MySQL some time ago.
	In the Users table there was stuff that other users need to see (like his  
login name, etc), and stuff that only this user needs to see (like his  
preferences).
	So, when displaying posts in the forum, for instance, only a small part  
of the fields in the Users table was needed, the rest was just dead  
weight, that made the table unable to fit in RAM.

So I split the table, and it was faster.

However, you can also buy more RAM...

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[GENERAL] Performance issues of one vs. two split tables.

2007-05-14 Thread Bill Moseley

Is there any benefit of splitting up a table into two tables that will
always have a one-to-one relationship?

Say I have a "user" table that has first, last, email, password, and
last_accessed columns.  This user table will be accessed often.  (It's
not really "user", but that's not important in this discussion)

Say that there's also about 10 columns of settings or preferences for
each user.  Are there any cases or reasons to have a separate
"user_preferences" table vs. just placing all the columns together in
one table?


Another related question:

Say I have a table "color" that has about 20 different rows ("red",
"blue", "green", etc.).  I want the user to be able to select zero or
more favorite colors.  I would typically use a link table:

create table favorite_colors (
color   int references color(id),
userint references user(id)
);

Now, that table can have a large number of rows if I have a large
number of users and if everyone likes all the colors.

For some value of "large", is there a time when one might consider
using a single column in the user or user_prefs table to represent
their color choices instead of a link table?

table user_prefs (
...
favorite_colors bit varying,
...
);

Where each bit represents the primary key of the colors table.

Seems like poor design, but I'm wondering if there might be overriding
concerns at times.

For example, if I have 1 million users and they each like all colors
and thus have a 20 million row link table how much space would be
saved by using a bit column as above?



-- 
Bill Moseley
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Performance/Issues with CMP and JBoss

2006-08-02 Thread James Robinson
Diagnosing JBossCMP issues is not for the faint of heart, in that one  
of its main raison d'etre's is to hide SQL knowledge away from the  
casual coder. Add into the mix the concurrency issues which naturally  
occur since an EJB container is multithreaded and the overly  
complicated JTA stuff and you've a real mess to diagnose when things  
don't work just so.


When we ran CMP, we'd also run postgres in debugging mode so as to  
have it emit queries onto stdout so that we could watch exactly what  
the CMP was doing. Try starting up postgres ala:


	/usr/local/pgsql/bin/postmaster -d 2 -i -D /usr/local/pgsql/data  
2>&1 | grep LOG


Your mileage may vary, as would your PGDATA dir etc. Read the docs on  
the postmaster. But the end result is the ability to watch each query  
fly by on your development machine -- letting you see the order of  
which updates, inserts, selects, and commits happen from the postgres  
backend's perspective.


I suspect the original poster has code issues being tickled by java  
threading issues interacting poorly with their default transaction  
isolation level of READ COMMITTED -- their inserts are being done in  
one thread / JTA transaction, while the read is being done in another  
and is loosing the race -- the inserting thread has not committed  
yet. Running the backend in debugging mode should let you see the  
select happening _before_ the first thread has committed its  
transaction. Postgres is doing exactly what it is being told -- if  
the inserting transaction has not yet committed, and the reading  
transaction's isolation level is set to READ COMMITTED, then postgres  
_will_not_ return anything to the reading connection / thread which  
has not yet been committed.


Good luck with reading and following all of the EJB and CMP  
specifications, the JBossCMP documentation, the JTA spec, and then  
swallowing all of postgres [ or any other SQL backend ]. If you don't  
have all of 'em fully understood yet, you will have to one day if you  
continue with all that fat tech which was supposed to make things  
easy for you. CMP is a very leaky overcomplicated abstraction.



James Robinson
Socialserve.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Performance/Issues with CMP and JBoss

2006-08-02 Thread Richard Huxton

Andy Dale wrote:

Hi,

I will explain in more details what the test (extremely simple) program is
actually doing.  A session bean receives some data (roughly 3K) and then
persists this data to the database with EntityManager.persist() (using the
EJB 3.0 Persistence API that comes with JBoss).  Once the persist method
returns a message is sent to a message driven bean where it tries to get 
the

previously persisted data from the database using the EntityManager.find()
method, this is where we run into problems with Postgres, it only seems to
find the persisted object 5% of the time.  I don't know if you class it 
as a

performance problem or an error but seems as it works in other databases i
am more inclined to classify this as an error.


Well that's easy to decide.
1. Did the transaction that stored the object complete successfully?
2. Should the results of that transaction be visible from the reading 
transaction?


If Yes & Yes, but you can't find the object there's an error.


When i say works really well, i mean it in the sense that a simple query
that the EntityManager produces for the find operation is something as
simple as "select  from table where  = ?", this is
causing problems for Postgres, the query is in the form of a prepared
statement so could this be causing any problems ?


Prepared queries mean you can't do certain optimisations, but for a 
single-table fetch on columns with a unique index I'd expect an index to 
be used (assuming analyse has been run recently).



As far as hardware goes my testing machine is:

P4 3.0 GHz
1GB RAM
20GB of HD (IDE)

But we intend to use a HP Prolient server with the following spec:

Intel Xeon 3.3 GHz
2 GB RAM
146GB Ultra SCSI 320

The Operating System being run on both is Fedora Core 5

The server/machine also needs to run other programs and processes so we
don't want the database to hog to much of the resources, about 10 - 20 % 
RAM

(and how to configure it) and CPU, the current config as defined in the
postgres.conf file is as so:


Squeezing PostgreSQL and cache-space for its data into 256MB is going to 
depend on how large your DB is. Oh, and if you have a lot of updates 
then disk will probably be the limiting factor.



# -
# PostgreSQL configuration file
# -
#
# This file consists of lines of the form:
#


We don't really need to see all the comment lines

#--- 


# RESOURCE USAGE (except WAL)
#--- 



# - Memory -

shared_buffers = 1000# min 16 or max_connections*2, 8KB each
#temp_buffers = 1000# min 100, 8KB each
#max_prepared_transactions = 5# can be 0 or more
# note: increasing max_prepared_transactions costs ~600 bytes of shared
memory
# per transaction slot, plus lock space (see max_locks_per_transaction).
#work_mem = 1024# min 64, size in KB
#maintenance_work_mem = 16384# min 1024, size in KB
#max_stack_depth = 2048# min 100, size in KB


OK, so you haven't actually done any configuration. PG *will* crawl with 
the default settings, it's setup so you can install it on your 5-year 
old laptop without keeling over. I recommend you read the short article 
at the following URL and start from there.

  http://www.powerpostgresql.com/PerfList

Oh, and if you don't know what vacuum, analyse and the autovacuum tool 
are you'll want to read the relevant parts of the manual.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Performance/Issues with CMP and JBoss

2006-08-02 Thread Richard Huxton

Andy Dale wrote:

Hi,

I have performed some tests earlier on today, and i think the problem lies
with Postgres and it's bad performance when being used with container
managed persistence. 


Is your problem performance or an error? It sounded like you were 
getting errors in your first post.


> I am covinced of it being an issue with postgres

because it works really well with MySQL and Hypersonic, but not with
Postgres. 


*What* works really well? Can you tell us what query/queries are giving 
you problems?


> I have been reading around on the internet and it seems that

Postgres does not work too well with container managed persistence, but
surely all i have to do is just change some configuration settings, or can
Postgres not be used with container managed persistence.


If you generate valid SQL then PostgreSQL can certainly process the 
queries. As to whether tuning will help, nobody can say because you 
haven't supplied details of:

1. Hardware
2. Operating System
3. Queries giving problems
4. Concurrency details
5. Current configuration settings
6. System activity (is CPU/RAM/IO maxed?)

Without at least *some* of these facts nobody can say anything useful.

--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [GENERAL] Performance/Issues with CMP and JBoss

2006-08-02 Thread Andy Dale
Hi,I have performed some tests earlier on today, and i think
the problem lies with Postgres and it's bad performance when being used
with container managed persistence.  I am covinced of it being an issue
with postgres because it works really well with MySQL and Hypersonic,
but not with Postgres.  I have been reading around on the internet and
it seems that Postgres does not work too well with container managed
persistence, but surely all i have to do is just change some
configuration settings, or can Postgres not be used with container
managed persistence.
Thanks,AndyOn 8/1/06, Douglas McNaught <[EMAIL PROTECTED]> wrote:
"Andy Dale" <[EMAIL PROTECTED]> writes:> The current problem we seem to have is that the data is persisted ok> (or at least it seems to be in there with pgadmin), but cannot be
> read back out of the database all the time (in fact for about 90% of> the time), the current behaviour of the application suggests it is> trying to read it back out of the database (using> EntityManager.find
()) before it has really been saved, and thus> fails to find the data.  Do i have to tweak some settings in the> postgres.conf file ? i have tried turning off fsync (i do not want> to do this, for reliability reasons) and it performed far better.
> Can anyone advise me on the changes i need to make to speed up the> inserting of data, i know that turning autocommit off is supposed to> increase performance.This is almost certainly a problem with your persistence layer rather
than with Postgres.  If you can see the data with PGAdmin then it's inthe database.  It may be that the transaction that saves the object isnot committing quickly, and so other connections don't see the object
until the commit happens.  But that's not the fault of Postgres.-Doug


Re: [GENERAL] Performance/Issues with CMP and JBoss

2006-08-01 Thread Douglas McNaught
"Andy Dale" <[EMAIL PROTECTED]> writes:

> The current problem we seem to have is that the data is persisted ok
> (or at least it seems to be in there with pgadmin), but cannot be
> read back out of the database all the time (in fact for about 90% of
> the time), the current behaviour of the application suggests it is
> trying to read it back out of the database (using
> EntityManager.find()) before it has really been saved, and thus
> fails to find the data.  Do i have to tweak some settings in the
> postgres.conf file ? i have tried turning off fsync (i do not want
> to do this, for reliability reasons) and it performed far better.
> Can anyone advise me on the changes i need to make to speed up the
> inserting of data, i know that turning autocommit off is supposed to
> increase performance.

This is almost certainly a problem with your persistence layer rather
than with Postgres.  If you can see the data with PGAdmin then it's in
the database.  It may be that the transaction that saves the object is
not committing quickly, and so other connections don't see the object
until the commit happens.  But that's not the fault of Postgres.

-Doug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[GENERAL] Performance/Issues with CMP and JBoss

2006-08-01 Thread Andy Dale
Hi,We currently have an JBoss web application that persists a byte array it recieves (Using the EJB persistence API), and then tries to read it from the Database again from further parsing.  The application works really well with the default Hypersonic datasource, but it will not work correctly when using postgres as the datasource.  
The current problem we seem to have is that the data is persisted ok (or at least it seems to be in there with pgadmin), but cannot be read back out of the database all the time (in fact for about 90% of the time), the current behaviour of the application suggests it is trying to read it back out of the database (using 
EntityManager.find()) before it has really been saved, and thus fails to find the data.  Do i have to tweak some settings in the postgres.conf file ? i have tried turning off fsync (i do not want to do this, for reliability reasons) and it performed far better.  Can anyone advise me on the changes i need to make to speed up the inserting of data, i know that turning  autocommit off is supposed to increase performance.
Thanks in advance,Andy