Re: [PERFORM] Finding rows in table T1 that DO NOT MATCH any row in table T2

2009-10-21 Thread Shaul Dar
Tom,

1. Actually I just tested you suggestion

SELECT COUNT (*) FROM T1 where NOT EXISTS
(SELECT 1 FROM T2 where T1.PK  = T2.FK )

and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it
took 1m8s,

My suggestion, i.e.

SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK  =
T2.FK
WHERE T2.FK  IS NULL

was about twice as fast, 37s. (both returned same number of rows, about 2/3
of T1)

However I can use DELETE with your version (instead of "SELECT COUNT (*)"
above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN.
Thanks!

2. BTW. I presented my question earlier in an overly simplified fashion.
Sorry. In actuality the two tables are joined on two columns,
say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb =
T2.FKb. So the IN versions suggested will not work
since AFAIK IN only works for a single value.

-- Shaul

On Tue, Oct 20, 2009 at 3:59 PM, Tom Lane  wrote:

> Shaul Dar  writes:
> > I assume this will work but will take a long time:
>
> > DELETE * FROM T1 where T1.PK NOT IN
> > (SELECT T1.PK FROM T1, T2 where T1.PK = T2.FK)
>
> Well, yeah, but it's unnecessarily inefficient --- why not just
>
> DELETE FROM T1 where T1.PK NOT IN
> (SELECT T2.FK FROM T2)
>
> However, that still won't be tremendously fast unless the subselect fits
> in work_mem.  As of 8.4 this variant should be reasonable:
>
> DELETE FROM T1 where NOT EXISTS
> (SELECT 1 FROM T2 where T1.PK = T2.FK)
>
> Pre-8.4 you should resort to the "left join where is null" trick,
> but there's no need to be so obscure as of 8.4.
>
>regards, tom lane
>


[PERFORM] Random penalties on GIN index updates?

2009-10-21 Thread jesper
Hi  (running PG8.4.1)

As far as I have gotten in my test of PG Full Text Search.. I have got
over 6m documents indexed so far and the index has grown to 37GB. The
systems didnt do any autovacuums in the process but I manually vacuumed a
few times and that stopped growth for a short period of time.

 table_name |   index_name| times_used | table_size | index_size |
num_writes |  definition
+-+++++--
 ftstest| body_tfs_idx |171 | 5071 MB| 37 GB  |   
6122086 | CREATE INDEX ftstest_tfs_idx ON ftstest USING gin
(ftstest_body_fts)
(1 row)

This is sort of what I'd expect this is not more scary than the Xapian
index it is comparing with. Search speed seems excellent. But I feel I'm
getting a significant drop-off in indexing speed as time goes by, I dont
have numbers to confirm this.

If i understand the technicalities correct then INSERT/UPDATES to the
index will be accumulated in the "maintainance_work_mem" and the "user"
being unlucky to fill it up will pay the penalty of merging all the
changes into the index?

I currently have "maintainance_work_mem" set to 128MB and according to
"pg_stat_activity" i currently have a insert sitting for over 1 hour. If I
strace the postgres process-id it is reading and writing a lot on the
filesystem and imposing an IO-wait load of 1 cpu.

Can I do something to prevent this from happening? Is it "by design"?

-- 
Jesper


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Random penalties on GIN index updates?

2009-10-21 Thread Tom Lane
jes...@krogh.cc writes:
> If i understand the technicalities correct then INSERT/UPDATES to the
> index will be accumulated in the "maintainance_work_mem" and the "user"
> being unlucky to fill it up will pay the penalty of merging all the
> changes into the index?

You can turn off the "fastupdate" index parameter to disable that,
but I think there may be a penalty in index bloat as well as insertion
speed.  It would be better to use a more conservative work_mem
(work_mem, not maintenance_work_mem, is what limits the amount of stuff
accumulated during normal inserts).

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] There is a statistic table?

2009-10-21 Thread Scott Carey



On 10/15/09 11:27 PM, "Albe Laurenz"  wrote:

> waldomiro wrote:
>> I need to know how much the postgres is going to disk to get
>> blocks and how much it is going to cache? witch is the
>> statistic table and what is the field that indicates blocks
>> reads from the disk and the memory cache?
> 
> The view pg_statio_all_tables will show you the number of
> disk reads and buffer hits per table.

My understanding is that it will not show that.  Since postgres can't
distinguish between a read that comes from OS cache and one that goes to
disk, you're out of luck on knowing anything exact.
The above shows what comes from shared_buffers versus the OS, however.  And
if reads are all buffered, they are not coming from disk.  Only those that
come from the OS _may_ have come from disk.

> 
> There are other statistics views, see
> http://www.postgresql.org/docs/8.4/static/monitoring-stats.html#MONITORING-STA
> TS-VIEWS
> 
>> Another question is, what is the best memory configuration to
>> keep more data in cache?
> 
> That's easy - the greater shared_buffers is, the more cache you have.
> 
> Another option is to choose shared_buffers not too large and let
> the filesystem cache buffer the database for you.
> 
> Yours,
> Laurenz Albe
> 
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
> 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] maintain_cluster_order_v5.patch

2009-10-21 Thread ph...@apra.asso.fr
Hi Jeff,

>> Hi all,
>> 
>> The current discussion about "Indexes on low cardinality columns" let
>> me discover this 
>> "grouped index tuples" patch (http://community.enterprisedb.com/git/)
>> and its associated 
>> "maintain cluster order" patch
>> (http://community.enterprisedb.com/git/maintain_cluster_order_v5.patch)
>> 
>> This last patch seems to cover the TODO item named "Automatically
>> maintain clustering on a table".
>
>The TODO item isn't clear about whether the order should be strictly
>maintained, or whether it should just make an effort to keep the table
>mostly clustered. The patch mentioned above makes an effort, but does
>not guarantee cluster order.
>
You are right, there are 2 different visions : a strictly maintained order or a 
 possibly maintained order.
This later is already a good enhancement as it largely decrease the time 
interval between 2 CLUSTER operations, in particular if the FILLFACTOR is 
properly set. In term of performance, having 99% of rows in the "right" page is 
not realy worse than having totaly optimized storage. 
The only benefit of a strictly maintained order is that there is no need for 
CLUSTER at all, which could be very interesting for very large databases with 
24/24 access constraint.
For our need, the "possibly maintained order" is enough.

>> As this patch is not so new (2007), I would like to know why it has
>> not been yet integrated in a standart version of PG (not well
>> finalized ? not totaly sure ? not corresponding to the way the core
>> team would like to address this item ?) and if there are good chance
>> to see it committed in a near future.
>
>Search the archives on -hackers for discussion. I don't think either of
>these features were rejected, but some of the work and benchmarking have
>not been completed.
OK, I will have a look.
>
>If you can help (either benchmark work or C coding), try reviving the
>features by testing them and merging them with the current tree.
OK, that's the rule of the game in such a community.
I am not a good C writer, but I will see what I could do.

> I recommend reading the discussion first, to see if there are any major
>problems.

>
>Personally, I'd like to see the GIT feature finished as well. When I
>have time, I was planning to take a look into it.
>
>Regards,
>   Jeff Davis



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Random penalties on GIN index updates?

2009-10-21 Thread Jesper Krogh
Tom Lane wrote:
> jes...@krogh.cc writes:
>> If i understand the technicalities correct then INSERT/UPDATES to the
>> index will be accumulated in the "maintainance_work_mem" and the "user"
>> being unlucky to fill it up will pay the penalty of merging all the
>> changes into the index?
> 
> You can turn off the "fastupdate" index parameter to disable that,
> but I think there may be a penalty in index bloat as well as insertion
> speed.  It would be better to use a more conservative work_mem
> (work_mem, not maintenance_work_mem, is what limits the amount of stuff
> accumulated during normal inserts). 

Ok, I read the manual about that. Seems worth testing, hat I'm seeing is
stuff like this:

2009-10-21T16:32:21
2009-10-21T16:32:25
2009-10-21T16:32:30
2009-10-21T16:32:35
2009-10-21T17:10:50
2009-10-21T17:10:59
2009-10-21T17:11:09
... then it went on steady for another 180.000 documents.

Each row is a printout from the application doing INSERTS, it print the
time for each 1000 rows it gets through. It is the 38minutes in the
middle I'm a bit worried about.

work_mem is set to 512MB, that may translate into 180.000 documents in
my system?

What I seems to miss a way to make sure som "background" application is
the one getting the penalty, so a random user doing a single insert
won't get stuck. Is that doable?

It also seems to lock out other inserts while being in this state.

-- 
Jesper

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Finding rows in table T1 that DO NOT MATCH any row in table T2

2009-10-21 Thread Scott Carey

On 10/21/09 4:52 AM, "Shaul Dar"  wrote:

> Tom,
> 
> 1. Actually I just tested you suggestion
> 
> SELECT COUNT (*) FROM T1 where NOT EXISTS
> (SELECT 1 FROM T2 where T1.PK   = T2.FK  )
> 
> and in worked in PG 8.3.8. On a DB with 6M T1 records and 5M T2 records it
> took 1m8s,
> 
> My suggestion, i.e.
> 
> SELECT COUNT(*) FROM T1 LEFT JOIN T2 ON T1.PK   = T2.FK
>  
> WHERE T2.FK   IS NULL
> 
> was about twice as fast, 37s. (both returned same number of rows, about 2/3 of
> T1)
> 
> However I can use DELETE with your version (instead of "SELECT COUNT (*)"
> above) but not with mine (can't have LEFT JOIN in DELETE), so YOU WIN. Thanks!
> 
> 2. BTW. I presented my question earlier in an overly simplified fashion.
> Sorry. In actuality the two tables are joined on two columns,
> say Ka and Kb (a composite key column), e.g. T1.PKa = T2.FKa and T1.PKb =
> T2.FKb. So the IN versions suggested will not work
> since AFAIK IN only works for a single value.

The performance will stink in many cases, but IN and NOT IN can work on
multiple values, for example:

WHERE (a.key1, a.key2) NOT IN (SELECT b.key1, b.key2 FROM b).

The fastest (in 8.4) is definitely NOT EXISTS.

WHERE NOT EXISTS (SELECT 1 FROM b WHERE (b.key1, b.key2) = (a.key1, a.key2))

  I've done this, deleting from tables with 15M + rows where I need a "not
in" on two or three columns on multiple other tables.
However, NOT EXISTS is only fast if every NOT EXISTS clause is a select on
one table, if it is multiple tables and a join, things can get ugly and the
planner might not optimize it right.  In that case use two NOT EXISTS
clauses.  Always look at the EXPLAIN plan.

With 8.4 -- for performance generally prefer the following:
* prefer JOIN and implicit joins to IN and EXISTS.
* prefer 'NOT EXISTS' to 'NOT IN' or  'LEFT JOIN where (right is null)'


> 
> -- Shaul
> 
> On Tue, Oct 20, 2009 at 3:59 PM, Tom Lane  wrote:
>> Shaul Dar  writes:
>>> I assume this will work but will take a long time:
>> 
>>> DELETE * FROM T1 where T1.PK   NOT IN
>>> (SELECT T1.PK   FROM T1, T2 where T1.PK   =
>>> T2.FK  )
>> 
>> Well, yeah, but it's unnecessarily inefficient --- why not just
>> 
>> DELETE FROM T1 where T1.PK   NOT IN
>> (SELECT T2.FK   FROM T2)
>> 
>> However, that still won't be tremendously fast unless the subselect fits
>> in work_mem.  As of 8.4 this variant should be reasonable:
>> 
>> DELETE FROM T1 where NOT EXISTS
>> (SELECT 1 FROM T2 where T1.PK   = T2.FK  )
>> 
>> Pre-8.4 you should resort to the "left join where is null" trick,
>> but there's no need to be so obscure as of 8.4.
>> 
>>                         regards, tom lane
> 
> 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Are unreferenced TOASTed values retrieved?

2009-10-21 Thread William Blunn

Hi guys,

Imagine if you will that I have a table thus

CREATE TABLE "lumps" (
   "id"SERIAL PRIMARY KEY,
   "name"TEXT NOT NULL,
   "data"BYTEA NOT NULL
);

Imagine I have stored say 1000 rows.

In each row, we have stored on average
20 bytes in column "name",
10 megabytes in column "data".

So my table contains 10 gigabytes of "data" and 20 kilobytes of "name"s.

The values in colum "data" will presumably be TOASTed.

Now, I go ahead and run the following query:

SELECT "name" FROM "lumps";

Clearly the query will need to retrieve something from all 1000 rows.

And now we get to the question:

Will the query engine retrieve the entire row (including 10 megabytes of 
out-of-line TOASTed data) for every row, and then pick out column 
"name", and take an age to do so, OR will the query engine retrive just 
the "direct" row, which includes "name" in-line, and return those to me, 
in the blink of an eye?


Clearly the former would be slow and undesirable, and the latter quick 
and desirable.


Regards,

Bill

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Random penalties on GIN index updates?

2009-10-21 Thread Tom Lane
Jesper Krogh  writes:
> What I seems to miss a way to make sure som "background" application is
> the one getting the penalty, so a random user doing a single insert
> won't get stuck. Is that doable?

You could force a vacuum every so often, but I don't think that will
help the locking situation.  You really need to back off work_mem ---
512MB is probably not a sane global value for that anyway.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Are unreferenced TOASTed values retrieved?

2009-10-21 Thread Tom Lane
William Blunn  writes:
> Will the query engine retrieve the entire row (including 10 megabytes of 
> out-of-line TOASTed data) for every row, and then pick out column 
> "name", and take an age to do so,

No.  That's pretty much the whole point of the TOAST mechanism;
out-of-line values are not fetched unless actually needed.  See
http://developer.postgresql.org/pgdocs/postgres/storage-toast.html

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] There is a statistic table?

2009-10-21 Thread Scott Marlowe
On Wed, Oct 21, 2009 at 11:17 AM, Scott Carey  wrote:
>
>
>
> On 10/15/09 11:27 PM, "Albe Laurenz"  wrote:
>
>> waldomiro wrote:
>>> I need to know how much the postgres is going to disk to get
>>> blocks and how much it is going to cache? witch is the
>>> statistic table and what is the field that indicates blocks
>>> reads from the disk and the memory cache?
>>
>> The view pg_statio_all_tables will show you the number of
>> disk reads and buffer hits per table.
>
> My understanding is that it will not show that.  Since postgres can't
> distinguish between a read that comes from OS cache and one that goes to
> disk, you're out of luck on knowing anything exact.
> The above shows what comes from shared_buffers versus the OS, however.  And
> if reads are all buffered, they are not coming from disk.  Only those that
> come from the OS _may_ have come from disk.

I think he meant pg's shared_buffers not the OS kernel cache.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] optimizing query with multiple aggregates

2009-10-21 Thread Doug Cole
I have a reporting query that is taking nearly all of it's time in aggregate
functions and I'm trying to figure out how to optimize it.  The query takes
approximately 170ms when run with "select *", but when run with all the
aggregate functions the query takes 18 seconds.  The slowness comes from our
attempt to find distribution data using selects of the form:

SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)

repeated across many different x,y values and fields to build out several
histograms of the data.  The main culprit appears to be the CASE statement,
but I'm not sure what to use instead.  I'm sure other people have had
similar queries and I was wondering what methods they used to build out data
like this?
Thanks for your help,
Doug


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-21 Thread Merlin Moncure
On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole  wrote:
> I have a reporting query that is taking nearly all of it's time in aggregate
> functions and I'm trying to figure out how to optimize it.  The query takes
> approximately 170ms when run with "select *", but when run with all the
> aggregate functions the query takes 18 seconds.  The slowness comes from our
> attempt to find distribution data using selects of the form:
>
> SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
>
> repeated across many different x,y values and fields to build out several
> histograms of the data.  The main culprit appears to be the CASE statement,
> but I'm not sure what to use instead.  I'm sure other people have had
> similar queries and I was wondering what methods they used to build out data
> like this?

have you tried:

count(*) where field >= x AND field < y;

??

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-21 Thread Doug Cole
On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure  wrote:
>
> On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole  wrote:
> > I have a reporting query that is taking nearly all of it's time in aggregate
> > functions and I'm trying to figure out how to optimize it.  The query takes
> > approximately 170ms when run with "select *", but when run with all the
> > aggregate functions the query takes 18 seconds.  The slowness comes from our
> > attempt to find distribution data using selects of the form:
> >
> > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
> >
> > repeated across many different x,y values and fields to build out several
> > histograms of the data.  The main culprit appears to be the CASE statement,
> > but I'm not sure what to use instead.  I'm sure other people have had
> > similar queries and I was wondering what methods they used to build out data
> > like this?
>
> have you tried:
>
> count(*) where field >= x AND field < y;
>
> ??
>
> merlin

Unless I'm misunderstanding you, that would require breaking each bin
into a separate sql statement and since I'm trying to calculate more
than 100 bins between the different fields any improvement in the
aggregate functions would be overwhelmed by the cost of the actual
query, which is about 170ms.
Thanks,
Doug

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-21 Thread David Wilson
On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole  wrote:

>
> repeated across many different x,y values and fields to build out several
> histograms of the data.  The main culprit appears to be the CASE statement,
> but I'm not sure what to use instead.  I'm sure other people have had
> similar queries and I was wondering what methods they used to build out data
> like this?
>

Use group by with an appropriate division/rounding to create the appropriate
buckets, if they're all the same size.

select round(field/100) as bucket, count(*) as cnt from foo group by
round(field/100);

-- 
- David T. Wilson
david.t.wil...@gmail.com


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-21 Thread Nikolas Everett
So you've got a query like:
SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN 1 ELSE 0 END) as
zeroToTen,
  SUM(CASE WHEN field >= 10 AND field < 20 THEN 1 ELSE 0 END) as
tenToTwenty,
  SUM(CASE WHEN field >= 20 AND field < 30 THEN 1 ELSE 0 END) as
tenToTwenty,
...
FROM  bigtable


My guess is this forcing a whole bunch of if checks and your getting cpu
bound.  Could you try something like:

SELECT SUM(CASE WHEN field >= 0 AND field < 10 THEN count ELSE 0 END) as
zeroToTen,
  SUM(CASE WHEN field >= 10 AND field < 20 THEN count ELSE 0
END) as tenToTwenty,
  SUM(CASE WHEN field >= 20 AND field < 30 THEN count ELSE 0
END) as tenToTwenty,
...
FROM  (SELECT field, count(*) FROM bigtable GROUP BY field)

which will allow a hash aggregate?  You'd do a hash aggregate on the whole
table which should be quick and then you'd summarize your bins.

This all supposes that you don't want to just query postgres's column
statistics.

On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole  wrote:

> On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure 
> wrote:
> >
> > On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole  wrote:
> > > I have a reporting query that is taking nearly all of it's time in
> aggregate
> > > functions and I'm trying to figure out how to optimize it.  The query
> takes
> > > approximately 170ms when run with "select *", but when run with all the
> > > aggregate functions the query takes 18 seconds.  The slowness comes
> from our
> > > attempt to find distribution data using selects of the form:
> > >
> > > SUM(CASE WHEN field >= x AND field < y THEN 1 ELSE 0 END)
> > >
> > > repeated across many different x,y values and fields to build out
> several
> > > histograms of the data.  The main culprit appears to be the CASE
> statement,
> > > but I'm not sure what to use instead.  I'm sure other people have had
> > > similar queries and I was wondering what methods they used to build out
> data
> > > like this?
> >
> > have you tried:
> >
> > count(*) where field >= x AND field < y;
> >
> > ??
> >
> > merlin
>
> Unless I'm misunderstanding you, that would require breaking each bin
> into a separate sql statement and since I'm trying to calculate more
> than 100 bins between the different fields any improvement in the
> aggregate functions would be overwhelmed by the cost of the actual
> query, which is about 170ms.
> Thanks,
> Doug
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: [PERFORM] Random penalties on GIN index updates?

2009-10-21 Thread Robert Haas
On Wed, Oct 21, 2009 at 2:35 PM, Tom Lane  wrote:
> Jesper Krogh  writes:
>> What I seems to miss a way to make sure som "background" application is
>> the one getting the penalty, so a random user doing a single insert
>> won't get stuck. Is that doable?
>
> You could force a vacuum every so often, but I don't think that will
> help the locking situation.  You really need to back off work_mem ---
> 512MB is probably not a sane global value for that anyway.

Yeah, it's hard to imagine a system where that doesn't threaten all
kinds of other bad results.  I bet setting this to 4MB will make this
problem largely go away.

Arguably we shouldn't be using work_mem to control this particular
behavior, but...

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Random penalties on GIN index updates?

2009-10-21 Thread Jesper Krogh
Robert Haas wrote:
> On Wed, Oct 21, 2009 at 2:35 PM, Tom Lane  wrote:
>> Jesper Krogh  writes:
>>> What I seems to miss a way to make sure som "background" application is
>>> the one getting the penalty, so a random user doing a single insert
>>> won't get stuck. Is that doable?
>> You could force a vacuum every so often, but I don't think that will
>> help the locking situation.  You really need to back off work_mem ---
>> 512MB is probably not a sane global value for that anyway.
> 
> Yeah, it's hard to imagine a system where that doesn't threaten all
> kinds of other bad results.  I bet setting this to 4MB will make this
> problem largely go away.
> 
> Arguably we shouldn't be using work_mem to control this particular
> behavior, but...

I came from Xapian, where you only can have one writer process, but
batching up in several GB's improved indexing performance dramatically.
Lowering work_mem to 16MB gives "batches" of 11.000 documents and stall
between 45 and 90s.  ~ 33 docs/s

-- 
Jesper

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] maintain_cluster_order_v5.patch

2009-10-21 Thread Heikki Linnakangas
ph...@apra.asso.fr wrote:
> Hi Jeff,
>> If you can help (either benchmark work or C coding), try reviving the
>> features by testing them and merging them with the current tree.
> OK, that's the rule of the game in such a community.
> I am not a good C writer, but I will see what I could do.

The FSM rewrite in 8.4 opened up more options for implementing this. The
patch used to check the index for the block the nearest key is stored
in, read that page in, and insert there if there's enough free space on
it. with the new FSM, you can check how much space there is on that
particular page before fetching it. And if it's full, the new FSM data
structure can be searched for a page with enough free space as close as
possible to the old page, although there's no interface to do that yet.

A completely different line of attack would be to write a daemon that
concurrently moves tuples in order to keep the table clustered. It would
interfere with UPDATEs and DELETEs, and ctids of the tuples would
change, but for many use cases it would be just fine. We discussed a
utility like that as a replacement for VACUUM FULL on hackers a while
ago, see thread "Feedback on getting rid of VACUUM FULL". A similar
approach would work here, the logic for deciding which tuples to move
and where would just be different.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance