Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Jeff Davis <[EMAIL PROTECTED]> writes:
> The problem is, I think people would be more frustrated by 1 in 1000
> queries starting the scan in the wrong place because a hint was deleted,

Yeah --- various people have been complaining recently about how we have
good average performance and bad worst case, and this behavior would
definitely be more of the same.  So I'm kind of backing away from the
idea of deleting the hint.  But if we could change the hint behavior to
say "start reading here", successive short LIMITed reads would all start
reading from the same point, which fixes both my reproducibility concern
and Heikki's original point about being able to re-use cached data.
You'd only get into the irreproducible behavior if the LIMIT was larger
than the amount of data scanned between hint updates.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 18:25 -0400, Tom Lane wrote:
> But note that barring backend crash, once all the scans are done it is
> guaranteed that the hint will be removed --- somebody will be last to
> update the hint, and therefore will remove it when they do heap_endscan,
> even if others are not quite done.  This is good in the sense that
> later-starting backends won't be fooled into starting at what is
> guaranteed to be the most pessimal spot, but it's got a downside too,
> which is that there will be windows where seqscans are in process but
> a newly started scan won't see them.  Maybe that's a killer objection.

I don't think it would be a major objection. If there aren't other
sequential scans in progress, the point is moot, and if there are:
(a) the hint has a lower probability of being removed, since it may
contain the PID of one of those other scans.
(b) the hint is likely to be replaced quite quickly

The problem is, I think people would be more frustrated by 1 in 1000
queries starting the scan in the wrong place because a hint was deleted,
because that could cause a major difference in performance. I expect the
current patch would have more consistent performance for that reason.

To me, it seems to be a small benefit and a small cost. It's hard for me
to feel very strongly either way.

> When exactly is the hint updated?  I gathered from something Heikki said
> that it's set after processing X amount of data, but I think it might be
> better to set it *before* processing X amount of data.  That is, the
> hint means "I'm going to be scanning at least  blocks
> starting here", not "I have scanned  blocks ending here",
> which seems like the interpretation that's being used at the moment.
> What that would mean is that successive "LIMIT 1000" calls would in fact
> all start at the same place, barring interference from other backends.
> 

If I understand correctly, this is a one-page difference in the report
location, right? We can either report that we've just finished scanning
block 1023 (ending an X block chunk of reading) and another backend can
start scanning at 1023 (current behavior); or we could report that we're
about to scan an X block chunk of data starting with block 1024, and the
new scan can start at 1024. We don't want the new scan to jump in ahead
of the existing scan, because then we're introducing uncached blocks
between the two scans -- risking divergence. 

If the data occupies less than X data pages, the LIMIT queries will be
deterministic for single-scans anyway, because no reports will happen
(other than the starting location, which won't matter in this case).

If the data is more than that, then at least one report would have
happened. At this point, you're talking about rewinding the scan (how
far?), which I originally coded for with sync_seqscan_offset. That
feature didn't prove very useful (yet), so I removed it. 

Regards,
Jeff Davis





---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Gregory Stark

"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:

> LIMIT without ORDER BY is worse because it not only returns tuples in 
> different
> order, but it can return different tuples altogether when you run it multiple
> times.

I don't think printing a notice is feasible. For regular DML a notice or info
message is basically equivalent to an error. It makes it entirely impractical
to use such a query in a production system where it will spam your logs with
thousands of messages.

Now it might be worth considering making this an error. We already make it an
error to have DISTINCT ON without an matching ORDER BY or a full outer join
without a merge-joinable operator. Both of those are really implementation
limitations but they're also arguably conceptual limitations in that it's hard
(though not impossible) to imagine it working any differently.

However... there are circumstances where having non-deterministic queries is
perfectly reasonable. Perhaps you're implementing a work queue and want to
process a batch of records but don't care which. Or perhaps you want to cap
the number of records a search result will display to a reasonable value that
won't cause problems but you expect under normal conditions not to reach that
limit.

There are also cases where people use OFFSET and LIMIT with degenerate values
(either OFFSET 0 or LIMIT ) to induce the planner to plan queries
differently knowing that it won't actually change the results.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Gregory Stark

"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:

> Were you thinking of storing the PID of the backend that originally created 
> the
> hint, or updating the PID every time the hint is updated? In any case, we 
> still
> wouldn't know if there's other scanners still running.

My reaction was if you always put the pid in when updating the hint it might
work out pretty well. When you finish you remove the hint iff the pid is your
own. 

It has exactly the same properties you describe of leaving a small window with
no hint when you finish a scan but only if you were the last to update the
scan position which i would expect would be pretty rare except for the last
scanner.

If a backend died it would leave a scan position behind but the next scanner
on that table would overwrite the pid and then remove it when it's finished.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Jeff Davis <[EMAIL PROTECTED]> writes:
> My thought was that every time the location was reported by a backend,
> it would store 3 pieces of information, not 2:
>  * relfilenode
>  * the PID of the backend that created or updated this particular hint
> last
>  * the location

> Then, on heap_endscan() (if that's the right place), we find the hint,
> and if the PID matches, we remove it. If not, it does nothing.

> This would only matter when there weren't other scans. When concurrent
> scans were happening, chances are the PID wouldn't match anyway, and
> thus not be removed.

But note that barring backend crash, once all the scans are done it is
guaranteed that the hint will be removed --- somebody will be last to
update the hint, and therefore will remove it when they do heap_endscan,
even if others are not quite done.  This is good in the sense that
later-starting backends won't be fooled into starting at what is
guaranteed to be the most pessimal spot, but it's got a downside too,
which is that there will be windows where seqscans are in process but
a newly started scan won't see them.  Maybe that's a killer objection.

When exactly is the hint updated?  I gathered from something Heikki said
that it's set after processing X amount of data, but I think it might be
better to set it *before* processing X amount of data.  That is, the
hint means "I'm going to be scanning at least  blocks
starting here", not "I have scanned  blocks ending here",
which seems like the interpretation that's being used at the moment.
What that would mean is that successive "LIMIT 1000" calls would in fact
all start at the same place, barring interference from other backends.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Michael Glaesemann


On Jun 4, 2007, at 16:34 , Heikki Linnakangas wrote:

LIMIT without ORDER BY is worse because it not only returns tuples  
in different order, but it can return different tuples altogether  
when you run it multiple times.


Wouldn't DISTINCT ON suffer from the same issue without ORDER BY?

Michael Glaesemann
grzm seespotcode net



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 22:57 +0100, Heikki Linnakangas wrote:
> > That's what I thought at first, and why I didn't do it. Right now I'm
> > thinking we could just add the PID to the hint, so that it would only
> > remove its own hint. Would that work?
> 
> Were you thinking of storing the PID of the backend that originally 
> created the hint, or updating the PID every time the hint is updated? In 
> any case, we still wouldn't know if there's other scanners still running.
> 

My thought was that every time the location was reported by a backend,
it would store 3 pieces of information, not 2:
 * relfilenode
 * the PID of the backend that created or updated this particular hint
last
 * the location

Then, on heap_endscan() (if that's the right place), we find the hint,
and if the PID matches, we remove it. If not, it does nothing.

This would only matter when there weren't other scans. When concurrent
scans were happening, chances are the PID wouldn't match anyway, and
thus not be removed.

Regards,
Jeff Davis




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Jeff Davis wrote:

On Mon, 2007-06-04 at 22:09 +0100, Heikki Linnakangas wrote:

I think the real problem here is that the first scan is leaving state
behind that changes the behavior of the next scan.  Which can have no
positive benefit, since obviously the first scan is not still
proceeding; the best you can hope for is that it's a no-op and the worst
case is that it actively pessimizes things.  Why doesn't the patch
remove the shmem entry at scan termination?
Because there's no reason why it should, and it would require a lot more 
bookkeeping. There can be many scanners on the same table, so we'd need 
to implement some kind of reference counting, which means having to 
reliably decrement the counter when a scan terminates.




That's what I thought at first, and why I didn't do it. Right now I'm
thinking we could just add the PID to the hint, so that it would only
remove its own hint. Would that work?


Were you thinking of storing the PID of the backend that originally 
created the hint, or updating the PID every time the hint is updated? In 
any case, we still wouldn't know if there's other scanners still running.


We could just always remove the hint when a scan ends, and rely on the 
fact that if there's other scans still running they will put the hint 
back very quickly. There would then be a small window where there's no 
hint but a scan is active, and a new scan starting during that window 
would fail to synchronize with the other scanners.



It's still vulnerable to a backend being killed and the hint hanging
around. However, the next scan would clear get it back to normal.


Oh, did you mean that the PID would be updated whenever a new scan 
starts? So that the PID stored would always be the PID of the latest 
scanner. That might work pretty well, though a small scan with a LIMIT, 
or any other situation where some scans run faster than others, might 
clear the hint prematurely while other scans are still running.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 22:09 +0100, Heikki Linnakangas wrote:
> > I think the real problem here is that the first scan is leaving state
> > behind that changes the behavior of the next scan.  Which can have no
> > positive benefit, since obviously the first scan is not still
> > proceeding; the best you can hope for is that it's a no-op and the worst
> > case is that it actively pessimizes things.  Why doesn't the patch
> > remove the shmem entry at scan termination?
> 
> Because there's no reason why it should, and it would require a lot more 
> bookkeeping. There can be many scanners on the same table, so we'd need 
> to implement some kind of reference counting, which means having to 
> reliably decrement the counter when a scan terminates.
> 

That's what I thought at first, and why I didn't do it. Right now I'm
thinking we could just add the PID to the hint, so that it would only
remove its own hint. Would that work?

It's still vulnerable to a backend being killed and the hint hanging
around. However, the next scan would clear get it back to normal.
Reference counting would cause weirdness if a backend died or something,
because it would never decrement to 0.

> In any case if there actually is a concurrent scan, you'd still see 
> different ordering. Removing the entry when a scan is over would just 
> make it harder to trigger.
> 

Agreed. I don't know for sure whether that's good or bad, but it would
make the nondeterminism less immediately visible.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Michael Glaesemann wrote:
I think the warning on LIMIT without ORDER BY is a good idea, 
regardless of the synchronized scans patch.


I'm not saying this isn't a good idea, but are there other places where 
there might be gotchas for the unwary, such as DISTINCT without ORDER BY 
or (for an unrelated example) UNION versus UNION ALL? How many of these 
types of messages would be useful?


LIMIT without ORDER BY is worse because it not only returns tuples in 
different order, but it can return different tuples altogether when you 
run it multiple times.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 16:42 -0400, Tom Lane wrote:
> Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> > I don't think anyone can reasonably expect to get the same ordering when 
> > the same query issued twice in general, but within the same transaction 
> > it wouldn't be that unreasonable. If we care about that, we could keep 
> > track of starting locations per transaction, only do the synchronization 
> > on the first scan in a transaction, and start subsequent scans from the 
> > same page as the first one.
> 
> I think the real problem here is that the first scan is leaving state
> behind that changes the behavior of the next scan.  Which can have no
> positive benefit, since obviously the first scan is not still
> proceeding; the best you can hope for is that it's a no-op and the worst
> case is that it actively pessimizes things.  Why doesn't the patch
> remove the shmem entry at scan termination?
> 

Sounds like a reasonable idea to me. We could add the PID to the data
structure so that it would only remove the hint if it's the one that set
the hint. I think we'd just need to call a function to do that from
heap_endscan(), correct?

However, we couldn't 100% guarantee that the state would be cleared. A
backend could be killed in the middle of a scan.

The case you're worried about seems very narrow to me, and I think
"actively pessimizes" is too strong. Unless I misunderstand, "the best
you can hope for" no-op happens in all cases except a most bizarre one:
that in which you're executing repeated identical LIMIT queries with no
ORDER BY; and the tuples returned occupy more than 128K (16 pages is the
reporting period) but fewer would be effective to cache; and the table
in question is larger than the large table threshold. I'm just trying to
add some perspective about what we're fixing, here.

But it's fair to say that a scan should clear any state when it's done.

Regards,
Jeff Davis




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Michael Glaesemann


On Jun 4, 2007, at 15:24 , Heikki Linnakangas wrote:

I don't think anyone can reasonably expect to get the same ordering  
when the same query issued twice in general, but within the same  
transaction it wouldn't be that unreasonable.


The order rows are returned without an ORDER BY clause *is*  
implementation dependent, and is not guaranteed, at least by the  
spec. Granted, LIMIT without ORDER BY (and DISTINCT for that matter)  
brings this into sharp relief.


I think the warning on LIMIT without ORDER BY is a good idea,  
regardless of the synchronized scans patch.


I'm not saying this isn't a good idea, but are there other places  
where there might be gotchas for the unwary, such as DISTINCT without  
ORDER BY or (for an unrelated example) UNION versus UNION ALL? How  
many of these types of messages would be useful?


Michael Glaesemann
grzm seespotcode net



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
I don't think anyone can reasonably expect to get the same ordering when 
the same query issued twice in general, but within the same transaction 
it wouldn't be that unreasonable. If we care about that, we could keep 
track of starting locations per transaction, only do the synchronization 
on the first scan in a transaction, and start subsequent scans from the 
same page as the first one.


I think the real problem here is that the first scan is leaving state
behind that changes the behavior of the next scan.  Which can have no
positive benefit, since obviously the first scan is not still
proceeding; the best you can hope for is that it's a no-op and the worst
case is that it actively pessimizes things.  Why doesn't the patch
remove the shmem entry at scan termination?


Because there's no reason why it should, and it would require a lot more 
bookkeeping. There can be many scanners on the same table, so we'd need 
to implement some kind of reference counting, which means having to 
reliably decrement the counter when a scan terminates.


In any case if there actually is a concurrent scan, you'd still see 
different ordering. Removing the entry when a scan is over would just 
make it harder to trigger.


I think the warning on LIMIT without ORDER BY is a good idea, regardless 
of the synchronized scans patch.


I seriously doubt that can be done in any way that doesn't both warn
about perfectly-safe cases and fail to warn about other unsafe ones.
Furthermore, it's not uncommon for people to do "SELECT * ... LIMIT 1"
just to remind themselves of column names or whatever.  Do we really
want the thing to be so nannyish?  


It really depends on how many false negatives and positives it gives. If 
too many, it's just annoying, but if it's reasonably accurate I think it 
would be OK to remind people running queries like that.



I was envisioning simply a stronger warning in the SELECT reference page ...


I doubt the people that would be bitten by this read the SELECT 
reference page.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> I don't think anyone can reasonably expect to get the same ordering when 
> the same query issued twice in general, but within the same transaction 
> it wouldn't be that unreasonable. If we care about that, we could keep 
> track of starting locations per transaction, only do the synchronization 
> on the first scan in a transaction, and start subsequent scans from the 
> same page as the first one.

I think the real problem here is that the first scan is leaving state
behind that changes the behavior of the next scan.  Which can have no
positive benefit, since obviously the first scan is not still
proceeding; the best you can hope for is that it's a no-op and the worst
case is that it actively pessimizes things.  Why doesn't the patch
remove the shmem entry at scan termination?

> I think the warning on LIMIT without ORDER BY is a good idea, regardless 
> of the synchronized scans patch.

I seriously doubt that can be done in any way that doesn't both warn
about perfectly-safe cases and fail to warn about other unsafe ones.
Furthermore, it's not uncommon for people to do "SELECT * ... LIMIT 1"
just to remind themselves of column names or whatever.  Do we really
want the thing to be so nannyish?  I was envisioning simply a stronger
warning in the SELECT reference page ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Jeff Davis wrote:

No surprise here, as you and Bruce have already pointed out.

If we wanted to reduce the occurrence of this phenomena, we could
perhaps "time out" the hints so that it's impossible to pick up a hint
from a scan that finished 5 minutes ago.

It doesn't seem helpful to further obscure the non-determinism of the
results, however.


I agree it's probably not a good idea to try masking this. It'll just 
make it harder to hit the issue in testing, so that you run into it in 
production.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
For the record, this patch has a small negative impact on scans like 
"SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS 
HEAD the first 1000 rows will stay in buffer cache, but with the patch 
each scan will start from roughly where previous one stopped, requiring 
more pages to be read from disk each time. I don't think it's something 
to worry about in practice, but I thought I'd mention it.

Urgh.  The answers change depending on (more or less) the phase of the
moon?  I've got a serious problem with that.  You might look back to
1997 when GEQO very nearly got tossed out entirely because it destroyed
reproducibility of query results.


What about the simple idea of just disabling the use of a sync scan when
the query has LIMIT and no ORDER BY, and start always at block 0 in that
case?


That handles the LIMIT case, but you would still observe the different 
ordering. And some people do LIMIT-like behavior in client side, by 
opening a cursor and only fetching first n rows.


I don't think anyone can reasonably expect to get the same ordering when 
the same query issued twice in general, but within the same transaction 
it wouldn't be that unreasonable. If we care about that, we could keep 
track of starting locations per transaction, only do the synchronization 
on the first scan in a transaction, and start subsequent scans from the 
same page as the first one. That way if you issue the same query twice 
in a transaction, or do something like:

BEGIN;
SELECT * FROM queue FOR UPDATE LIMIT 10
do stuff..
DELETE FROM queue LIMIT 10
COMMIT;

you'd get the expected result.

I think the warning on LIMIT without ORDER BY is a good idea, regardless 
of the synchronized scans patch.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Alvaro Herrera
Tom Lane wrote:
> Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> > For the record, this patch has a small negative impact on scans like 
> > "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS 
> > HEAD the first 1000 rows will stay in buffer cache, but with the patch 
> > each scan will start from roughly where previous one stopped, requiring 
> > more pages to be read from disk each time. I don't think it's something 
> > to worry about in practice, but I thought I'd mention it.
> 
> Urgh.  The answers change depending on (more or less) the phase of the
> moon?  I've got a serious problem with that.  You might look back to
> 1997 when GEQO very nearly got tossed out entirely because it destroyed
> reproducibility of query results.

What about the simple idea of just disabling the use of a sync scan when
the query has LIMIT and no ORDER BY, and start always at block 0 in that
case?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 10:53 +0100, Heikki Linnakangas wrote:
> I'm now done with this patch and testing it.
> 

Great!

> For the record, this patch has a small negative impact on scans like 
> "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS 
> HEAD the first 1000 rows will stay in buffer cache, but with the patch 
> each scan will start from roughly where previous one stopped, requiring 
> more pages to be read from disk each time. I don't think it's something 
> to worry about in practice, but I thought I'd mention it.
> 

No surprise here, as you and Bruce have already pointed out.

If we wanted to reduce the occurrence of this phenomena, we could
perhaps "time out" the hints so that it's impossible to pick up a hint
from a scan that finished 5 minutes ago.

It doesn't seem helpful to further obscure the non-determinism of the
results, however.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Tom Lane wrote:

Bruce Momjian <[EMAIL PROTECTED]> writes:

As I understand it, the problem is that while currently LIMIT without
ORDER BY always starts at the beginning of the table, it will not with
this patch.  I consider that acceptable.


It's definitely going to require stronger warnings than we have now
about using LIMIT without ORDER BY.


Along the lines of

NOTICE: LIMIT without ORDER BY returns an arbitrary set of matching rows

perhaps? I wonder how easy it is to detect that in the planner.

Or just a remark in the manual?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> As I understand it, the problem is that while currently LIMIT without
> ORDER BY always starts at the beginning of the table, it will not with
> this patch.  I consider that acceptable.

It's definitely going to require stronger warnings than we have now
about using LIMIT without ORDER BY.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Bruce Momjian
Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> >> For the record, this patch has a small negative impact on scans like 
> >> "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS 
> >> HEAD the first 1000 rows will stay in buffer cache, but with the patch 
> >> each scan will start from roughly where previous one stopped, requiring 
> >> more pages to be read from disk each time. I don't think it's something 
> >> to worry about in practice, but I thought I'd mention it.
> > 
> > Urgh.  The answers change depending on (more or less) the phase of the
> > moon?  I've got a serious problem with that.  You might look back to
> > 1997 when GEQO very nearly got tossed out entirely because it destroyed
> > reproducibility of query results.
> 
> That's a very fundamental result of this patch, unfortunately. It only 
> happens on scans on tables larger than the threshold. And because we 
> only report the current scan location every 128KB, if you repeat the 
> same SELECT .. LIMIT X query with no other scanners on that table, 
> you'll get the same results as long as X is smaller than 128KB.
> 
> I thought we've been through this issue already...

Agreed.  I thought we always said that a LIMIT without an ORDER BY was
meaningless, particuarly because an intervening UPDATE could have moved
rows to another place in the table.  In fact, at one time we considered
prevening LIMIT without ORDER BY because it was meaningless, but decided
if people want unstable results, they should be able to get them.

An argument could be made that a LIMIT without ORDER BY on a table
locked read-only should be stable.

As I understand it, the problem is that while currently LIMIT without
ORDER BY always starts at the beginning of the table, it will not with
this patch.  I consider that acceptable.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
For the record, this patch has a small negative impact on scans like 
"SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS 
HEAD the first 1000 rows will stay in buffer cache, but with the patch 
each scan will start from roughly where previous one stopped, requiring 
more pages to be read from disk each time. I don't think it's something 
to worry about in practice, but I thought I'd mention it.


Urgh.  The answers change depending on (more or less) the phase of the
moon?  I've got a serious problem with that.  You might look back to
1997 when GEQO very nearly got tossed out entirely because it destroyed
reproducibility of query results.


That's a very fundamental result of this patch, unfortunately. It only 
happens on scans on tables larger than the threshold. And because we 
only report the current scan location every 128KB, if you repeat the 
same SELECT .. LIMIT X query with no other scanners on that table, 
you'll get the same results as long as X is smaller than 128KB.


I thought we've been through this issue already...

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Heikki Linnakangas <[EMAIL PROTECTED]> writes:
> For the record, this patch has a small negative impact on scans like 
> "SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS 
> HEAD the first 1000 rows will stay in buffer cache, but with the patch 
> each scan will start from roughly where previous one stopped, requiring 
> more pages to be read from disk each time. I don't think it's something 
> to worry about in practice, but I thought I'd mention it.

Urgh.  The answers change depending on (more or less) the phase of the
moon?  I've got a serious problem with that.  You might look back to
1997 when GEQO very nearly got tossed out entirely because it destroyed
reproducibility of query results.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

I'm now done with this patch and testing it.


I fixed a little off-by-one in "backward scan, not inited" branch, but I 
was unable to test it. It seems that code is actually never used because 
that case is optimized to a rewind in the executor. I marked those 
seemingly unreachable places in the code with a comment.


I didn't touch the large scan threshold of NBuffers / 4 Tom that 
committed as part of the buffer ring patch. IOW I removed the GUC 
variable from the patch. I think the jury is still out there on this one.


I included a basic regression test as well. It creates a ~10MB table, 
which with the default 32MB shared_buffers setting is large enough that 
synchronized scans are used. It then runs a query with a LIMIT so that 
it scans ~1/2 of a table. and then runs a new seqscan and checks that it 
returns rows from the second half of the table. This is a bit flakey, as 
the needed table size depends on the large scan threshold, and we can't 
test the actual concurrent behavior, but it's better than nothing. 10 MB 
works for "make check", but isn't enough if one runs "installcheck" 
against an existing installation with a larger shared_buffers. I 
therefore only added the test case to the parallel_schedule, though it 
still breaks "installcheck-parallel". If we later add the GUC variable, 
that should be used in the test case.


For the record, this patch has a small negative impact on scans like 
"SELECT * FROM foo LIMIT 1000". If such a scan is run repeatedly, in CVS 
HEAD the first 1000 rows will stay in buffer cache, but with the patch 
each scan will start from roughly where previous one stopped, requiring 
more pages to be read from disk each time. I don't think it's something 
to worry about in practice, but I thought I'd mention it.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/heap/Makefile
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/Makefile,v
retrieving revision 1.15
diff -c -r1.15 Makefile
*** src/backend/access/heap/Makefile	8 Apr 2007 01:26:27 -	1.15
--- src/backend/access/heap/Makefile	31 May 2007 10:21:28 -
***
*** 12,18 
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o
  
  all: SUBSYS.o
  
--- 12,18 
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o syncscan.o
  
  all: SUBSYS.o
  
Index: src/backend/access/heap/heapam.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.234
diff -c -r1.234 heapam.c
*** src/backend/access/heap/heapam.c	30 May 2007 20:11:53 -	1.234
--- src/backend/access/heap/heapam.c	4 Jun 2007 08:41:56 -
***
*** 85,104 
  
  	/*
  	 * If the table is large relative to NBuffers, use a bulk-read access
! 	 * strategy, else use the default random-access strategy.  During a
! 	 * rescan, don't make a new strategy object if we don't have to.
  	 */
  	if (scan->rs_nblocks > NBuffers / 4 &&
  		!scan->rs_rd->rd_istemp)
  	{
  		if (scan->rs_strategy == NULL)
  			scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);
  	}
  	else
  	{
  		if (scan->rs_strategy != NULL)
  			FreeAccessStrategy(scan->rs_strategy);
  		scan->rs_strategy = NULL;
  	}
  
  	scan->rs_inited = false;
--- 85,108 
  
  	/*
  	 * If the table is large relative to NBuffers, use a bulk-read access
! 	 * strategy and enable synchronized scanning (see syncscan.c). 
! 	 * During a rescan, don't make a new strategy object if we don't have to.
  	 */
  	if (scan->rs_nblocks > NBuffers / 4 &&
  		!scan->rs_rd->rd_istemp)
  	{
  		if (scan->rs_strategy == NULL)
  			scan->rs_strategy = GetAccessStrategy(BAS_BULKREAD);
+ 
+ 		scan->rs_startpage = ss_get_location(scan);
  	}
  	else
  	{
  		if (scan->rs_strategy != NULL)
  			FreeAccessStrategy(scan->rs_strategy);
  		scan->rs_strategy = NULL;
+ 
+ 		scan->rs_startpage = 0;
  	}
  
  	scan->rs_inited = false;
***
*** 229,234 
--- 233,239 
  	Snapshot	snapshot = scan->rs_snapshot;
  	bool		backward = ScanDirectionIsBackward(dir);
  	BlockNumber page;
+ 	BlockNumber prevpage;
  	Page		dp;
  	int			lines;
  	OffsetNumber lineoff;
***
*** 251,257 
  tuple->t_data = NULL;
  return;
  			}
! 			page = 0;			/* first page */
  			heapgetpage(scan, page);
  			lineoff = FirstOffsetNumber;		/* first offnum */
  			scan->rs_inited = true;
--- 256,262 
  tuple->t_data = NULL;
  return;
  			}
! 			page = scan->rs_startpage;			/* first page */
  			heapgetpage(scan, page);
  			lineoff = FirstOffsetNumber;		/* first offnum */
  			scan->rs_inited = true;
***
*** 276,281 
--- 281,290 
  	{
  		if (!scan->rs_inited)
  		

Re: [PATCHES] Synchronized Scan WIP patch

2007-06-04 Thread Heikki Linnakangas

Jeff Davis wrote:

On Thu, 2007-05-31 at 09:08 +0100, Heikki Linnakangas wrote:
* moved the sync scan stuff to a new file access/heapam/syncscan.c. 
heapam.c is long enough already, and in theory the same mechanism could 
be used for large bitmap heap scans in the future.


Good idea, I hadn't thought of that. It seems like the bitmaps in two
bitmap scans would have to match very closely, but that sounds
plausible.


Yeah, it's a pretty narrow use case, but plausible in theory.


This is similar to another idea I had considered (I forget who thought
of it) to try to have a bitmap of "tuples still needed" and then try to
optimize based on that information somehow (read the ones in cache
first, etc). Seems substantially more complex though, more like a
prefetch system at that point.

I expected the general refactoring. Hopefully my next patch is a little
closer to the code expectations and places less burden on the reviewers.


No worries, that's the easy part.


Testing:
* Multiple scans on different tables, causing movement in the LRU list
* Measure the CPU overhead for a single scan
* Measure the lock contention with multiple scanners


Is there any way to measure the necessity of the hash table? I would
think the conditions for that would be a large number of tables being
actively scanned causing a lot of LRU activity such that the locks are
held too long. 


Yep, and it's hard to imagine a system like that.


I also think the optimization of only reporting when the block is not
found in cache would be useful to test if the lock contention is a problem.


I tried to demonstrate lock contention by running 10 backends all 
repeatedly scanning different tables that are bigger than the sync scan 
threshold but small enough to all fit in OS cache. The total runtime of 
the tests was the same, ~45 s, with and without the patch. That's pretty 
much the worst case scenario I could think of, so it seems that 
contention of the SyncScanLock is not an issue. There was some "missed 
updates" of the scan location, due to the LWLockConditionalAcquire that 
I put there to reduce lock contention, but not too much to worry about.


I'll post an updated patch shortly..

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend