Re: [GENERAL] huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search
On Tue, 28 Sep 2010 20:19:10 +0200 Alban Hertroys wrote: > On 28 Sep 2010, at 12:49, Ivan Sergio Borgonovo wrote: > > The hardware on the 2 machines is a bit different. > > MS SQL 2005 runs on a box that is 2 years old, 2 SATA drives on > > RAID 1 hw, 2 Xeon dual core (I can't check details right now) > > PG runs on a box that has more than 5 years, 3 SCSI drives on > > RAID 5 hw, 2 Xeon single core HT (Intel(R) Xeon(TM) CPU 3.20GHz, > > cpu fam 14, model 4) > > In both tables I've to update price. > > VACUUM FULL was performed just before updating the prices. > VACUUM FULL? Was that really necessary? You did REINDEX after > that, didn't you? If not, your indexes became bloated. If the > table wasn't empty before you probably meant to do a VACUUM > ANALYSE, but if it was, just ANALYSE would have been sufficient. ... > Did you ANALYSE between loading the data and updating? I thought VACUUM FULL was more "magical" and implied a REINDEX. Am I wrong? The index that should be reindexed is the one on the pk, a simple btree, that's not going to be as slow as rebuilding a gin... still I'd really thought that VACUUM FULL implied a lot of things (including ANALYZE as well). > Also, executing the trigger on each copied line is likely to be a > little slow, due to the overhead of calling a stored procedure > (especially if it's plpgsql). It's probably quite a bit faster to > disable the trigger and create the gin-index after loading the > data. I'll try to drop the trigger. I'm not expecting it the biggest factor still adding something here and something there may end up in the huge difference between the 2. Anyway MS SQL seems to overcome all this nuisances auto-magically. > An EXPLAIN ANALYSE of that statement would tell what it's doing > and what's taking so long. A simple EXPLAIN would probably be > sufficient to see what query plan it thinks it needs though. I'll post the EXPLAIN. Before I run EXPLAIN ANALYZE I've to take some precaution the DB doesn't explode. > Did you tune that database? Several options (work_mem for example) > could significantly improve your performance if you can set them > higher (or reduce it if you set them too high). You can do that > per session too. pg is generally faster than the other MS SQL box on what's normally done on a daily basis. Just large updates to the product page seems to be a pain. Other INSERT/UPDATE operations are seldom performed, they involve smaller tables with no gin index. > > If HW can justify such huge difference I'll devote my time to > > other problems. > Partially, yes, but not that much I think. That's my worry... but still in many circumstances pg performs better than the MS SQL box... yeah... on pretty different workload... but while on other workloads pg is a bit faster (20% to 100% faster) even if it is on an older box, on this one is very slow. > > I'd say that a potential culprit could be the gin index. No > > matter if the tsvector is updated or not, if the row is changed > > I think the index is going to be updated anyway. > gin indexes require relatively much RAM. If you didn't assign much > in your settings then it's quite possible that the database can't > keep the index in memory or that things have to spill to disk. > Leave enough room for the OS's disk cache though, Postgres > benefits from that as well. > Is there any particular reason you went with a gin index and not a > gist one? Gin can be faster, but consumes (much) more memory, but > gist is also quite good with text searches and doesn't require > quite as much memory. gin index is doing a very good work and well full text searches are the typical workload of that box and the one that is more important to be fast. I'd say if gin was occupying so much memory performances wouldn't be so good on a daily basis. I'd post excerpt of my postgres.conf (what's important here?) and see if anything can be improved for *this* workload and temporary tune the DB for this exceptional update still I'm going to continue to be a bit surprised of such a huge difference even if it will come out that it was actually a .conf not suited for this workload. BTW the box is running Apache and php. thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search
On Tue, Sep 28, 2010 at 02:35:09PM +0300, Allan Kamau wrote: > I have access > to a server running PG 8.4 on Ubuntu and I have noticed that after a > day of intense use the PG slows down significantly, "free -g" reports > almost no free memory available (something seems to leak memory on > this Ubuntu box). I'm not sure which values you're looking at in "free", but you generally want there to be very little free memory--you want the memory to be used for caching the disk. As long as the cached data isn't "dirty" (i.e. unwritten data) then it can be released very quickly and made available for whatever is needed, or better serve as a useful cache. Rebooting normally just hides other issues. -- Sam http://samason.me.uk/ -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search
On 28 Sep 2010, at 12:49, Ivan Sergio Borgonovo wrote: > The hardware on the 2 machines is a bit different. > MS SQL 2005 runs on a box that is 2 years old, 2 SATA drives on RAID > 1 hw, 2 Xeon dual core (I can't check details right now) > PG runs on a box that has more than 5 years, 3 SCSI drives on RAID 5 > hw, 2 Xeon single core HT (Intel(R) Xeon(TM) CPU 3.20GHz, cpu fam > 14, model 4) RAID-5 isn't ideal for databases, the RAID-1 in the other box is probably faster. Especially since it's on more modern hardware. > In both tables I've to update price. > VACUUM FULL was performed just before updating the prices. VACUUM FULL? Was that really necessary? You did REINDEX after that, didn't you? If not, your indexes became bloated. If the table wasn't empty before you probably meant to do a VACUUM ANALYSE, but if it was, just ANALYSE would have been sufficient. > MS SQL receives a large sql file that contain all the UPDATE > statements. > PG receive a csv file that is loaded into a table with COPY and then > does the update as > update products set price=p.price from temp_price where id=p.id and > price<>p.price; Did you ANALYSE between loading the data and updating? Also, executing the trigger on each copied line is likely to be a little slow, due to the overhead of calling a stored procedure (especially if it's plpgsql). It's probably quite a bit faster to disable the trigger and create the gin-index after loading the data. > MS SQL ingurgitate the whole sql file in around 10sec. > pg takes more than 5 min to just run the single update statement. An EXPLAIN ANALYSE of that statement would tell what it's doing and what's taking so long. A simple EXPLAIN would probably be sufficient to see what query plan it thinks it needs though. Since an update means PG has to insert and delete records, the fact that the database is on RAID-5 is probably a factor here. How much? No idea. > I'd like to know if such a large difference can be justified just by > HW difference or by a difference in the process on how data are > loaded [1] or by the difference in performance of the 2 servers on > this kind of workload or by some postgres config before I decide how > to manage my time to redesign the import procedure. Did you tune that database? Several options (work_mem for example) could significantly improve your performance if you can set them higher (or reduce it if you set them too high). You can do that per session too. > If HW can justify such huge difference I'll devote my time to other > problems. Partially, yes, but not that much I think. > I'd say that a potential culprit could be the gin index. No matter > if the tsvector is updated or not, if the row is changed I think the > index is going to be updated anyway. gin indexes require relatively much RAM. If you didn't assign much in your settings then it's quite possible that the database can't keep the index in memory or that things have to spill to disk. Leave enough room for the OS's disk cache though, Postgres benefits from that as well. Is there any particular reason you went with a gin index and not a gist one? Gin can be faster, but consumes (much) more memory, but gist is also quite good with text searches and doesn't require quite as much memory. > [1] I'd expect that excluding the time it takes to load the csv a > single update should run faster than a huge list of single statement > update Correct. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4ca231ae678301692839670! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search
On Tue, Sep 28, 2010 at 1:49 PM, Ivan Sergio Borgonovo wrote: > I know I'm comparing apples and orange but still the difference in > performance was quite astonishing. > > I've 2 tables that look like: > > create table products( > id bigint > price double precision, /* legacy, don't ask */ > sometextfield1 varchar(128), > sometextfield2 varchar(128), > ... > ); > > one on a MS SQL 2005 and another one on pg 8.3. > > MS SQL has full text search on the text fields (I don't know the > details). > > pg product table has a tsvector field and a gin index defined on it + > trigger that update the tsvector field when the textfields change. > The trigger is made in a way that it actually update the tsvector > just if the text fields are changed. > > The hardware on the 2 machines is a bit different. > MS SQL 2005 runs on a box that is 2 years old, 2 SATA drives on RAID > 1 hw, 2 Xeon dual core (I can't check details right now) > PG runs on a box that has more than 5 years, 3 SCSI drives on RAID 5 > hw, 2 Xeon single core HT (Intel(R) Xeon(TM) CPU 3.20GHz, cpu fam > 14, model 4) > > Both have 4Gb of ram. > shared_buffers is 240Mb. > Both share a similar workload. > Both boxes were in the same "price class" when they were bought. > > In both tables I've to update price. > VACUUM FULL was performed just before updating the prices. > > MS SQL receives a large sql file that contain all the UPDATE > statements. > PG receive a csv file that is loaded into a table with COPY and then > does the update as > update products set price=p.price from temp_price where id=p.id and > price<>p.price; > > MS SQL ingurgitate the whole sql file in around 10sec. > pg takes more than 5 min to just run the single update statement. > > I'd like to know if such a large difference can be justified just by > HW difference or by a difference in the process on how data are > loaded [1] or by the difference in performance of the 2 servers on > this kind of workload or by some postgres config before I decide how > to manage my time to redesign the import procedure. > > If HW can justify such huge difference I'll devote my time to other > problems. > > I'd say that a potential culprit could be the gin index. No matter > if the tsvector is updated or not, if the row is changed I think the > index is going to be updated anyway. > Somehow MS SQL circumvent this problem, possibly by building the > equivalent of a tsvector column in a "hidden" table that > automatically join to the "text" table. > This add a join but reduce the cost of table modification since > simpler (btree) indexes are faster to update. > > Still huge updates are rare and that table is mostly read and very > rarely written. During unusually huge updates I may consider to drop > the gin index. > > [1] I'd expect that excluding the time it takes to load the csv a > single update should run faster than a huge list of single statement > update > > -- > Ivan Sergio Borgonovo > http://www.webthatworks.it > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > Restarting the computer on which PG is running may help. I have access to a server running PG 8.4 on Ubuntu and I have noticed that after a day of intense use the PG slows down significantly, "free -g" reports almost no free memory available (something seems to leak memory on this Ubuntu box). But when I restart the OS (Ubuntu), the PG executes my queries in good time. I seem not to have similar problems on the other servers running Fedora 12 and 13. But it could be my configuration(s) on the Ubuntu box at fault, I am still investigating. Allan. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] huge difference in performance between MS SQL and pg 8.3 on UPDATE with full text search
I know I'm comparing apples and orange but still the difference in performance was quite astonishing. I've 2 tables that look like: create table products( id bigint price double precision, /* legacy, don't ask */ sometextfield1 varchar(128), sometextfield2 varchar(128), ... ); one on a MS SQL 2005 and another one on pg 8.3. MS SQL has full text search on the text fields (I don't know the details). pg product table has a tsvector field and a gin index defined on it + trigger that update the tsvector field when the textfields change. The trigger is made in a way that it actually update the tsvector just if the text fields are changed. The hardware on the 2 machines is a bit different. MS SQL 2005 runs on a box that is 2 years old, 2 SATA drives on RAID 1 hw, 2 Xeon dual core (I can't check details right now) PG runs on a box that has more than 5 years, 3 SCSI drives on RAID 5 hw, 2 Xeon single core HT (Intel(R) Xeon(TM) CPU 3.20GHz, cpu fam 14, model 4) Both have 4Gb of ram. shared_buffers is 240Mb. Both share a similar workload. Both boxes were in the same "price class" when they were bought. In both tables I've to update price. VACUUM FULL was performed just before updating the prices. MS SQL receives a large sql file that contain all the UPDATE statements. PG receive a csv file that is loaded into a table with COPY and then does the update as update products set price=p.price from temp_price where id=p.id and price<>p.price; MS SQL ingurgitate the whole sql file in around 10sec. pg takes more than 5 min to just run the single update statement. I'd like to know if such a large difference can be justified just by HW difference or by a difference in the process on how data are loaded [1] or by the difference in performance of the 2 servers on this kind of workload or by some postgres config before I decide how to manage my time to redesign the import procedure. If HW can justify such huge difference I'll devote my time to other problems. I'd say that a potential culprit could be the gin index. No matter if the tsvector is updated or not, if the row is changed I think the index is going to be updated anyway. Somehow MS SQL circumvent this problem, possibly by building the equivalent of a tsvector column in a "hidden" table that automatically join to the "text" table. This add a join but reduce the cost of table modification since simpler (btree) indexes are faster to update. Still huge updates are rare and that table is mostly read and very rarely written. During unusually huge updates I may consider to drop the gin index. [1] I'd expect that excluding the time it takes to load the csv a single update should run faster than a huge list of single statement update -- Ivan Sergio Borgonovo http://www.webthatworks.it -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general