date:20101120

Re: [HACKERS] Fwd: patch: format function - fixed oid

2010-11-20 Thread Pavel Stehule

Hello

2010/11/20 Jeff Janes jeff.ja...@gmail.com:
 On Thu, Nov 18, 2010 at 11:54 PM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 -- Forwarded message --
 From: Pavel Stehule pavel.steh...@gmail.com
 Date: 2010/11/18
 Subject: Re: patch: format function, next generation
 To: Jeff Janes jeff.ja...@gmail.com
 Kopie: pgsql-hackers-ow...@postgresql.org

 Hello

 somebody takes my oid :)

 updated patch is in attachment

 Regards

 Pavel Stehule

 Dear Pavel and Hackers,

 I've reviewed this patch.  It applied, makes, and passes make check.
 It has added regression tests that seem appropriate.  I think the
 feature added matches the consensus that emerged from the very long
 email discussion.  The C code seems fine (to my meager abilities to
 judge that).

 But I think the documentation does need some work.  From func.sgml:

         This functions can be used to create a formated string or
 message. There are allowed
         three types of tags: %s as string, %I as SQL identifiers and
 %L as SQL literals. Attention:
         result for %I and %L must not be same as result of
 functionquote_ident/function and
         functionquote_literal/function functions, because this
 function doesn't try to coerce
         parameters to typetext/type type and directly use a
 type's output functions. The placeholder
         can be related to some explicit parameter with using a
 optional n$ specification inside format.

 Should we make it explicit that this is inspired by C's sprintf?  Do
 we want to call them tags?  This is introducing what seems to be a
 new word to describe what are usually (I think) called conversion
 specifiers.

I am not native speaker, so I invite any correction in documentation -
anything what I wrote is more/less just skelet for somebody, whu can
speak better than me. I am not against to note - so this function is
similar to sprintf function, but then should be specified - so this
function is different - designed to request PL languages and
environment. I have not a problem with replacing tags by conversion
or positional specifiers. Somewhere is used term placeholder??

 Must not be the same  should be Might not be the same.   However,
 it does not appear that quote_ident is willing to use coercion at all,
 and the %L behavior is more comparable to quote_nullable.

 Maybe:

 This function can be used to create a formatted string suitable for
 use as dynamic
 SQL or as a message.  There are three types of conversion specifiers:
 %s for literal strings, %I
 for SQL identifiers, and %L for SQL literals.  Note that the results
 of the %L conversion
 might not be the same as the results of the
 functionquote_nullable/function function, as
 the latter coerces its argument to typetext/type while
 functionformat/function
 uses a type's output function.  A conversion can reference an explicit
 parameter position
 by using an optional n$ in the format specification.

I am for it

 Does type's output function need to cross-reference someplace?
 coercion is described elsewhere in this section of docs, but output
 functions are not.

probably output functions are described in some hacker parts
http://www.postgresql.org/docs/9.0/interactive/xtypes.html

 And for the changes to plpgsql.sgml, I would propose:

    para
     Building a string for dynamic SQL statement can be simplified
     by using the functionformat/function function (see xref
     linkend=functions-string):
 programlisting
 EXECUTE format('UPDATE tbl SET %I = %L WHERE key = %L', colname,
 newvalue, keyvalue);
 /programlisting
    The functionformat/function format can be used together with
     the literalUSING/literal clause:
 programlisting
 EXECUTE format('UPDATE tbl SET %I = $1 WHERE key = $2', colname)
   USING newvalue, keyvalue;
 /programlisting
     This form is more efficient because the parameters
     literalnewvalue/literal and literalkeyvalue/literal are
 not converted to text.
    /para

+1

 These are mostly grammatical changes, but with the last three lines I
 may have missed the meaning of what you originally intended--I'm not
 sure on that.

I think so you are a correct. Who will a submit this final version? You or me?

Thank you very much

Regards

Pavel Stehule

 Thanks,

 Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Robert Haas

On Fri, Nov 19, 2010 at 10:18 PM, Robert Haas robertmh...@gmail.com wrote:
 Sure thing.  Thanks for taking time to do this - very nice speedup.
 This part now committed, too.

It occurs to me belatedly that there might be a better way to do this.
 Instead of flipping value from negative to positive, with a special
case for the smallest possible integer, we could do it the other
round.  And actually, I think we can rid of neg, too.

if (value  0)
*a++ = '-';
else
value = -value;
start = a;

Then we could just adjust the calculation of the actual digit.

*a++ = '0' + (-remainder);

Good idea?  Bad idea?  Seems cleaner to me, assuming it'll actually work...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: patch: format function - fixed oid

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 2:59 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 I think so you are a correct. Who will a submit this final version? You or me?

I've got it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: patch: format function - fixed oid

2010-11-20 Thread Pavel Stehule

2010/11/20 Robert Haas robertmh...@gmail.com:
 On Sat, Nov 20, 2010 at 2:59 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 I think so you are a correct. Who will a submit this final version? You or 
 me?

 I've got it.


Thank you

nice a day

Pavel

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for seg picksplit function

2010-11-20 Thread Yeb Havinga


On 2010-11-20 04:46, Robert Haas wrote:

On Tue, Nov 16, 2010 at 6:07 AM, Alexander Korotkov
aekorot...@gmail.com  wrote:

On Tue, Nov 16, 2010 at 3:07 AM, Robert Haasrobertmh...@gmail.com  wrote:

But on a broader note, I'm not very certain the sorting algorithm is
sensible.  For example, suppose you have 10 segments that are exactly
'0' and 20 segments that are exactly '1'.  Maybe I'm misunderstanding,
but it seems like this will result in a 15/15 split when we almost
certainly want a 10/20 split.  I think there will be problems in more
complex cases as well.  The documentation says about the less-than and
greater-than operators that These operators do not make a lot of
sense for any practical purpose but sorting.

In order to illustrate a real problem we should think about
gist behavior with great enough amount of data. For example, I tried to
extrapolate this case to 10 of segs where 40% are (0,1) segs and 60% are
(1,2) segs. And this case doesn't seem a problem for me.

Well, the problem with just comparing on  is that it takes very
little account of the upper bounds.  I think the cases where a simple
split would hurt you the most are those where examining the upper
bound is necessary to to get a good split.
With the current 8K default blocksize, I put my money on the sorting 
algorithm for any seg case. The r-tree algorithm's performance is 
probably more influenced by large buckets-low tree depth-generic keys 
on non leaf nodes.


regards,
Yeb Havinga



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for seg picksplit function

2010-11-20 Thread Alexander Korotkov

On Sat, Nov 20, 2010 at 6:46 AM, Robert Haas robertmh...@gmail.com wrote:

 Well, the problem with just comparing on  is that it takes very
 little account of the upper bounds.  I think the cases where a simple
 split would hurt you the most are those where examining the upper
 bound is necessary to to get a good split.

Yes, also such asymmetric solution seems not beautiful enough for me :).
It's easy to sort segs by their center, in this case lower and upper bound
will be used equally. New patch is attached. I checked it on various data
distributions.

1) Uniform distribution
test=# insert into seg_test (select (a || ' .. ' || a + 0.5*b)::seg from
(select random() as a, random() as b from generate_series(1,100)) x);
INSERT 0 100
Time: 79121,830 ms
test=# create index seg_test_idx on seg_test using gist (a);
CREATE INDEX
Time: 176409,434 ms
test=# explain (buffers, analyze) select * from seg_test where a @ '0.5 ..
0.5'::seg;
QUERY PLAN


--
 Bitmap Heap Scan on seg_test  (cost=28.19..2500.32 rows=1000 width=12)
(actual time=0.251..0.886 rows=27 loops=1)
   Recheck Cond: (a @ '0.5'::seg)
   Buffers: shared hit=3 read=27
   -  Bitmap Index Scan on seg_test_idx  (cost=0.00..27.94 rows=1000
width=0) (actual time=0.193..0.193 rows=27 loops=1)
 Index Cond: (a @ '0.5'::seg)
 Buffers: shared hit=3
 Total runtime: 1.091 ms
(7 rows)

Time: 41,884 ms

2) Natural distribution (Box–Muller transform was used for data generation)
test=# insert into seg_test (select ( a - 0.5*abs(b) || ' .. ' || a +
0.5*abs(b))::seg from (select
cos(2.0*pi()*random())*sqrt(-2.0*ln(random())) as a,
cos(2.0*pi()*random())*sqrt(-2.0*ln(random())) as b from
generate_series(1,100)) x);
INSERT 0 100
Time: 98614,305 ms
test=# create index seg_test_idx on seg_test using gist(a);
CREATE INDEX
Time: 212513,540 ms
test=# explain (buffers, analyze) select * from seg_test where a @ '0.3 ..
0.3'::seg;
QUERY PLAN


--
 Bitmap Heap Scan on seg_test  (cost=28.18..2500.31 rows=1000 width=12)
(actual time=0.132..0.428 rows=27 loops=1)
   Recheck Cond: (a @ '0.3'::seg)
   Buffers: shared hit=3 read=27
   -  Bitmap Index Scan on seg_test_idx  (cost=0.00..27.93 rows=1000
width=0) (actual time=0.103..0.103 rows=27 loops=1)
 Index Cond: (a @ '0.3'::seg)
 Buffers: shared hit=3
 Total runtime: 0.504 ms
(7 rows)

Time: 0,967 ms

3) Many distinct values
test=# insert into seg_test (select (a||'..'||(a+1))::seg from (select
(random()*13000)::integer as a from generate_series(1,100)) x);
INSERT 0 100
Time: 90775,952 ms
test=# create index seg_test_idx on seg_test using gist(a);
CREATE INDEX
Time: 200960,758 ms
test=# explain (buffers, analyze) select * from seg_test where a @ '700.0
.. 700.0'::seg;
QUERY PLAN


---
 Bitmap Heap Scan on seg_test  (cost=28.19..2500.33 rows=1000 width=12)
(actual time=0.358..3.531 rows=138 loops=1)
   Recheck Cond: (a @ '700.0'::seg)
   Buffers: shared hit=3 read=135
   -  Bitmap Index Scan on seg_test_idx  (cost=0.00..27.94 rows=1000
width=0) (actual time=0.270..0.270 rows=138 loops=1)
 Index Cond: (a @ '700.0'::seg)
 Buffers: shared hit=3
 Total runtime: 3.882 ms
(7 rows)

Time: 5,271 ms


With best regards,
Alexander Korotkov.
*** a/contrib/seg/seg.c
--- b/contrib/seg/seg.c
***
*** 292,329  gseg_penalty(GISTENTRY *origentry, GISTENTRY *newentry, float *result)
  	return (result);
  }
  
  
  
  /*
  ** The GiST PickSplit method for segments
! ** We use Guttman's poly time split algorithm
  */
  GIST_SPLITVEC *
  gseg_picksplit(GistEntryVector *entryvec,
  			   GIST_SPLITVEC *v)
  {
! 	OffsetNumber i,
! j;
! 	SEG		   *datum_alpha,
! 			   *datum_beta;
  	SEG		   *datum_l,
! 			   *datum_r;
! 	SEG		   *union_d,
! 			   *union_dl,
! 			   *union_dr;
! 	SEG		   *inter_d;
! 	bool		firsttime;
! 	float		size_alpha,
! size_beta,
! size_union,
! size_inter;
! 	float		size_waste,
! waste;
! 	float		size_l,
! size_r;
  	int			nbytes;
! 	OffsetNumber seed_1 = 1,
! seed_2 = 2;
  	OffsetNumber *left,
  			   *right;
  	OffsetNumber maxoff;
--- 292,340 
  	return (result);
  }
  
+ /*
+  * Auxiliary structure for picksplit method.
+  */
+ typedef struct
+ {
+ 	OffsetNumber index;
+ 	float center;
+ 	SEG *data;
+ } PickSplitSortItem;
  
+ /*
+  * Compare function for PickSplitSortItem based on seg_cmp.
+  */
+ static int
+ sort_item_cmp(const void *a, const void *b)
+ {
+ 	PickSplitSortItem *i1 =

Re: [HACKERS] [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 It occurs to me belatedly that there might be a better way to do this.
  Instead of flipping value from negative to positive, with a special
 case for the smallest possible integer, we could do it the other
 round.  And actually, I think we can rid of neg, too.

The trouble with that approach is that you have to depend on the
direction of rounding for negative quotients.  Which was unspecified
before C99, and it's precisely pre-C99 compilers that are posing a
hazard to the current coding.

FWIW, I find the code still pretty darn unsightly.  I think this change
is just wrong:

 * Avoid problems with the most negative integer not being representable
 * as a positive integer.
 */
-   if (value == INT32_MIN)
+   if (value == INT_MIN)
{
memcpy(a, -2147483648, 12);

and even with INT32_MIN it was pretty silly, because there is exactly 0
hope of the code behaving sanely for some other value of the symbolic
constant.  I think it'd be much better to abandon the macros altogether
and write

if (value == (-2147483647-1))
{
memcpy(a, -2147483648, 12);

Likewise for the int64 case, which BTW is no safer for pre-C99 compilers
than it was yesterday: LL is not the portable way to write int64
constants.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 10:38 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 The trouble with that approach is that you have to depend on the
 direction of rounding for negative quotients.  Which was unspecified
 before C99, and it's precisely pre-C99 compilers that are posing a
 hazard to the current coding.

Interesting.  I wondered whether there might be compilers out there
that handled that inconsistently, but then I thought I was probably
being paranoid.

 Likewise for the int64 case, which BTW is no safer for pre-C99 compilers
 than it was yesterday: LL is not the portable way to write int64
 constants.

Gah.  I wish we had some documentation of this stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Tom Lane

BTW, while we're thinking about marginal improvements: instead of
constructing the string backwards and then reversing it in-place,
what about building it working backwards from the end of the buffer
and then memmove'ing it down to the start of the buffer?

I haven't tested this but it seems likely to be roughly a wash
speed-wise.  The reason I find the idea attractive is that it will
immediately expose any caller that is providing a buffer shorter
than the required length, whereas now such callers will appear to
work fine if they're only tested on small values.

A small downside is that pg_itoa would then need its own implementation
instead of just punting to pg_ltoa.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] match_clause_to_indexcol()

2010-11-20 Thread Robert Haas

I was looking at KNNGIST some more today and found myself trying to
disentangle what match_clause_to_indexcol() is actually doing.  It
appears to me that the opfamily passed to that function is always the
same as index-opfamily[indexcol], which seems like needless
notational complexity.  Maybe I'm missing something, but the attached
patch seems to make things simpler and clearer.  Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 38b0930..7b8fd97 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -83,7 +83,7 @@ static PathClauseUsage *classify_index_clause_usage(Path *path,
 static void find_indexpath_quals(Path *bitmapqual, List **quals, List **preds);
 static int	find_list_position(Node *node, List **nodelist);
 static bool match_clause_to_indexcol(IndexOptInfo *index,
-		 int indexcol, Oid opfamily,
+		 int indexcol,
 		 RestrictInfo *rinfo,
 		 Relids outer_relids,
 		 SaOpControl saop_control);
@@ -1053,7 +1053,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 	List	   *clausegroup_list = NIL;
 	bool		found_outer_clause = false;
 	int			indexcol = 0;
-	Oid		   *families = index-opfamily;
 
 	*found_clause = false;		/* default result */
 
@@ -1062,7 +1061,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 
 	do
 	{
-		Oid			curFamily = families[0];
 		List	   *clausegroup = NIL;
 		ListCell   *l;
 
@@ -1074,7 +1072,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 			Assert(IsA(rinfo, RestrictInfo));
 			if (match_clause_to_indexcol(index,
 		 indexcol,
-		 curFamily,
 		 rinfo,
 		 outer_relids,
 		 saop_control))
@@ -1094,7 +1091,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 			Assert(IsA(rinfo, RestrictInfo));
 			if (match_clause_to_indexcol(index,
 		 indexcol,
-		 curFamily,
 		 rinfo,
 		 outer_relids,
 		 saop_control))
@@ -1113,9 +1109,8 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 		clausegroup_list = lappend(clausegroup_list, clausegroup);
 
 		indexcol++;
-		families++;
 
-	} while (!DoneMatchingIndexKeys(families));
+	} while (indexcol  index-ncolumns);
 
 	if (!*found_clause  !found_outer_clause)
 		return NIL;/* no indexable clauses anywhere */
@@ -1185,7 +1180,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 static bool
 match_clause_to_indexcol(IndexOptInfo *index,
 		 int indexcol,
-		 Oid opfamily,
 		 RestrictInfo *rinfo,
 		 Relids outer_relids,
 		 SaOpControl saop_control)
@@ -1196,6 +1190,7 @@ match_clause_to_indexcol(IndexOptInfo *index,
 	Relids		left_relids;
 	Relids		right_relids;
 	Oid			expr_op;
+	Oid			opfamily = index-opfamily[indexcol];
 	bool		plain_op;
 
 	/*
@@ -1582,23 +1577,18 @@ matches_any_index(RestrictInfo *rinfo, RelOptInfo *rel, Relids outer_relids)
 	{
 		IndexOptInfo *index = (IndexOptInfo *) lfirst(l);
 		int			indexcol = 0;
-		Oid		   *families = index-opfamily;
 
 		do
 		{
-			Oid			curFamily = families[0];
-
 			if (match_clause_to_indexcol(index,
 		 indexcol,
-		 curFamily,
 		 rinfo,
 		 outer_relids,
 		 SAOP_ALLOW))
 return true;
 
 			indexcol++;
-			families++;
-		} while (!DoneMatchingIndexKeys(families));
+		} while (indexcol  index-ncolumns);
 	}
 
 	return false;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] match_clause_to_indexcol()

2010-11-20 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 I was looking at KNNGIST some more today and found myself trying to
 disentangle what match_clause_to_indexcol() is actually doing.  It
 appears to me that the opfamily passed to that function is always the
 same as index-opfamily[indexcol], which seems like needless
 notational complexity.  Maybe I'm missing something, but the attached
 patch seems to make things simpler and clearer.  Thoughts?

+1.  I think the existing coding dates from a time when we didn't have
IndexOptInfo at all, or at least didn't pass it around to all these
sub-functions, so there was no other path for getting at the info.

But if you're going to do that, get rid of DoneMatchingIndexKeys
altogether, along with the extra zero that plancat.c adds to the
opfamily array.  We don't need to be using more than one way to
iterate over those arrays.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 12:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, while we're thinking about marginal improvements: instead of
 constructing the string backwards and then reversing it in-place,
 what about building it working backwards from the end of the buffer
 and then memmove'ing it down to the start of the buffer?

 I haven't tested this but it seems likely to be roughly a wash
 speed-wise.  The reason I find the idea attractive is that it will
 immediately expose any caller that is providing a buffer shorter
 than the required length, whereas now such callers will appear to
 work fine if they're only tested on small values.

 A small downside is that pg_itoa would then need its own implementation
 instead of just punting to pg_ltoa.

I think that might be more clever than is really warranted.  I get
your point about buffer overrun, but I don't think it's that hard for
callers to do the right thing, so I'm inclined to think that's not
worth much in this case.  Of course, if memmove() can be implemented
as a single assembler instruction or something, that might be
appealing from a speed standpoint, but otherwise I think we may as
well stick with this.  There's less chance of needlessly touching an
extra cache line, less chance of being confused by leftover garbage in
memory after the end of the output string, and less duplicate code.

I had given some thought to whether it might make sense to try to
figure out how long the string will be before we actually start
generating it, so that we can just start in the exactly right space
and have to clean up afterward.  But the obvious implementation seems
like it could be more expensive than just doing the copy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Sat, Nov 20, 2010 at 12:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 what about building it working backwards from the end of the buffer
 and then memmove'ing it down to the start of the buffer?

 I think that might be more clever than is really warranted.  I get
 your point about buffer overrun, but I don't think it's that hard for
 callers to do the right thing, so I'm inclined to think that's not
 worth much in this case.

Fair enough --- it was just a passing thought.

 I had given some thought to whether it might make sense to try to
 figure out how long the string will be before we actually start
 generating it, so that we can just start in the exactly right space
 and have to clean up afterward.  But the obvious implementation seems
 like it could be more expensive than just doing the copy.

Yeah.  You certainly don't want to do the division sequence twice,
and a log() call wouldn't be cheap either, and there don't seem to
be many other alternatives.  If we were going to get picky about
avoiding the reverse step, I'd go with Andres' idea of changing
the API to pass back an address instead of guaranteeing that the
result begins at the start of the buffer.  But I think that's much
more complicated for callers than it's worth.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] match_clause_to_indexcol()

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 1:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I was looking at KNNGIST some more today and found myself trying to
 disentangle what match_clause_to_indexcol() is actually doing.  It
 appears to me that the opfamily passed to that function is always the
 same as index-opfamily[indexcol], which seems like needless
 notational complexity.  Maybe I'm missing something, but the attached
 patch seems to make things simpler and clearer.  Thoughts?

 +1.  I think the existing coding dates from a time when we didn't have
 IndexOptInfo at all, or at least didn't pass it around to all these
 sub-functions, so there was no other path for getting at the info.

 But if you're going to do that, get rid of DoneMatchingIndexKeys
 altogether,

Sure.  That's a giant crock.

 along with the extra zero that plancat.c adds to the
 opfamily array.  We don't need to be using more than one way to
 iterate over those arrays.

I am slightly worried that this might break third-party code, or even
some part of the core code that's still using the old system.  I don't
mind if you want to rip it out, but personally, I'd rather leave that
part well enough alone.  Revised patch attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 38b0930..482026d 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -37,11 +37,6 @@
 #include utils/selfuncs.h
 
 
-/*
- * DoneMatchingIndexKeys() - MACRO
- */
-#define DoneMatchingIndexKeys(families) (families[0] == InvalidOid)
-
 #define IsBooleanOpfamily(opfamily) \
 	((opfamily) == BOOL_BTREE_FAM_OID || (opfamily) == BOOL_HASH_FAM_OID)
 
@@ -83,7 +78,7 @@ static PathClauseUsage *classify_index_clause_usage(Path *path,
 static void find_indexpath_quals(Path *bitmapqual, List **quals, List **preds);
 static int	find_list_position(Node *node, List **nodelist);
 static bool match_clause_to_indexcol(IndexOptInfo *index,
-		 int indexcol, Oid opfamily,
+		 int indexcol,
 		 RestrictInfo *rinfo,
 		 Relids outer_relids,
 		 SaOpControl saop_control);
@@ -1053,7 +1048,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 	List	   *clausegroup_list = NIL;
 	bool		found_outer_clause = false;
 	int			indexcol = 0;
-	Oid		   *families = index-opfamily;
 
 	*found_clause = false;		/* default result */
 
@@ -1062,7 +1056,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 
 	do
 	{
-		Oid			curFamily = families[0];
 		List	   *clausegroup = NIL;
 		ListCell   *l;
 
@@ -1074,7 +1067,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 			Assert(IsA(rinfo, RestrictInfo));
 			if (match_clause_to_indexcol(index,
 		 indexcol,
-		 curFamily,
 		 rinfo,
 		 outer_relids,
 		 saop_control))
@@ -1094,7 +1086,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 			Assert(IsA(rinfo, RestrictInfo));
 			if (match_clause_to_indexcol(index,
 		 indexcol,
-		 curFamily,
 		 rinfo,
 		 outer_relids,
 		 saop_control))
@@ -1113,9 +1104,8 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 		clausegroup_list = lappend(clausegroup_list, clausegroup);
 
 		indexcol++;
-		families++;
 
-	} while (!DoneMatchingIndexKeys(families));
+	} while (indexcol  index-ncolumns);
 
 	if (!*found_clause  !found_outer_clause)
 		return NIL;/* no indexable clauses anywhere */
@@ -1185,7 +1175,6 @@ group_clauses_by_indexkey(IndexOptInfo *index,
 static bool
 match_clause_to_indexcol(IndexOptInfo *index,
 		 int indexcol,
-		 Oid opfamily,
 		 RestrictInfo *rinfo,
 		 Relids outer_relids,
 		 SaOpControl saop_control)
@@ -1196,6 +1185,7 @@ match_clause_to_indexcol(IndexOptInfo *index,
 	Relids		left_relids;
 	Relids		right_relids;
 	Oid			expr_op;
+	Oid			opfamily = index-opfamily[indexcol];
 	bool		plain_op;
 
 	/*
@@ -1582,23 +1572,18 @@ matches_any_index(RestrictInfo *rinfo, RelOptInfo *rel, Relids outer_relids)
 	{
 		IndexOptInfo *index = (IndexOptInfo *) lfirst(l);
 		int			indexcol = 0;
-		Oid		   *families = index-opfamily;
 
 		do
 		{
-			Oid			curFamily = families[0];
-
 			if (match_clause_to_indexcol(index,
 		 indexcol,
-		 curFamily,
 		 rinfo,
 		 outer_relids,
 		 SAOP_ALLOW))
 return true;
 
 			indexcol++;
-			families++;
-		} while (!DoneMatchingIndexKeys(families));
+		} while (indexcol  index-ncolumns);
 	}
 
 	return false;
@@ -1621,11 +1606,10 @@ eclass_matches_any_index(EquivalenceClass *ec, EquivalenceMember *em,
 	{
 		IndexOptInfo *index = (IndexOptInfo *) lfirst(l);
 		int			indexcol = 0;
-		Oid		   *families = index-opfamily;
 
 		do
 		{
-			Oid			curFamily = families[0];
+			Oid			curFamily = index-opfamily[indexcol];
 
 			/*
 			 * If it's a btree index, we can reject it if its opfamily isn't
@@ -1643,8

Re: [HACKERS] match_clause_to_indexcol()

2010-11-20 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Sat, Nov 20, 2010 at 1:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 But if you're going to do that, get rid of DoneMatchingIndexKeys
 altogether,

 Sure.  That's a giant crock.

 along with the extra zero that plancat.c adds to the
 opfamily array.  We don't need to be using more than one way to
 iterate over those arrays.

 I am slightly worried that this might break third-party code, or even
 some part of the core code that's still using the old system.  I don't
 mind if you want to rip it out, but personally, I'd rather leave that
 part well enough alone.  Revised patch attached.

I'll take the responsibility if you don't want to ;-)

I think your revised patch is incorrect, or at least not terribly safe,
to just remove the last DoneMatchingIndexKeys test and not replace it
with anything else.  Other than that it looks okay as far as it goes.
Do you want to commit it yourself, or shall I incorporate it in a more
extensive cleanup?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] match_clause_to_indexcol()

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 1:37 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sat, Nov 20, 2010 at 1:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 But if you're going to do that, get rid of DoneMatchingIndexKeys
 altogether,

 Sure.  That's a giant crock.

 along with the extra zero that plancat.c adds to the
 opfamily array.  We don't need to be using more than one way to
 iterate over those arrays.

 I am slightly worried that this might break third-party code, or even
 some part of the core code that's still using the old system.  I don't
 mind if you want to rip it out, but personally, I'd rather leave that
 part well enough alone.  Revised patch attached.

 I'll take the responsibility if you don't want to ;-)

Sold! :-)

 I think your revised patch is incorrect, or at least not terribly safe,
 to just remove the last DoneMatchingIndexKeys test and not replace it
 with anything else.  Other than that it looks okay as far as it goes.
 Do you want to commit it yourself, or shall I incorporate it in a more
 extensive cleanup?

If it's all right with you, I'll go ahead and commit this and then you
can break as much more stuff as you like.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] match_clause_to_indexcol()

2010-11-20 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Sat, Nov 20, 2010 at 1:37 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think your revised patch is incorrect, or at least not terribly safe,
 to just remove the last DoneMatchingIndexKeys test and not replace it
 with anything else.

Oh, now that I look at it I notice the after-the-fact Assert claiming
that the clausegroup list isn't too long.  That can definitely be done
better.

 Do you want to commit it yourself, or shall I incorporate it in a more
 extensive cleanup?

 If it's all right with you, I'll go ahead and commit this and then you
 can break as much more stuff as you like.  :-)

Go for it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?

2010-11-20 Thread Robert Haas

On Fri, Nov 19, 2010 at 10:51 PM, Vaibhav Kaushal
vaibhavkaushal...@gmail.com wrote:
 Is no one ready to help on this? :(

Apparently, no one is ready to drop what they are doing to immediately
answer your email before all the other ones they need to answer, but I
wouldn't infer from that that no one wants to help.  It doesn't seem
very realistic to me to expect a 12-hour turnaround on free support,
but what do I know?

With regards to your question, for each type of plan node, there is an
associated plan state node.  This is an important distinction
because, IIUC, plans can be reused, so plan state contains the
information that might need to be reset on each run.  Scan state is
just a plan state node for a scan node.  I believe a tuple projection
is what you get when you do something like SELECT
generate_series(1,10) FROM tbl - the set-returning function has to be
projected in multiple tuples.  EState I could use some hints on
myself.  A qual is a filter condition, e.g. in SELECT * FROM tbl WHERE
x = 1, the x = 1 part is a qual.   It's helpful to grep src/include
for the structures in question; the information they contain often
helps to understand their purpose.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] match_clause_to_indexcol()

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 1:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 If it's all right with you, I'll go ahead and commit this and then you
 can break as much more stuff as you like.  :-)

 Go for it.

Your turn.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?

2010-11-20 Thread Vaibhav Kaushal

Thanks a lot for the answer. I see a lot of people discussing so many things
so I thought my email would have been ignored by those with a lot coming in
already. Thanks for the enlightenment.

-Vaibhav (*_*)


On Sun, Nov 21, 2010 at 12:25 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Nov 19, 2010 at 10:51 PM, Vaibhav Kaushal
 vaibhavkaushal...@gmail.com wrote:
  Is no one ready to help on this? :(

 Apparently, no one is ready to drop what they are doing to immediately
 answer your email before all the other ones they need to answer, but I
 wouldn't infer from that that no one wants to help.  It doesn't seem
 very realistic to me to expect a 12-hour turnaround on free support,
 but what do I know?

 With regards to your question, for each type of plan node, there is an
 associated plan state node.  This is an important distinction
 because, IIUC, plans can be reused, so plan state contains the
 information that might need to be reset on each run.  Scan state is
 just a plan state node for a scan node.  I believe a tuple projection
 is what you get when you do something like SELECT
 generate_series(1,10) FROM tbl - the set-returning function has to be
 projected in multiple tuples.  EState I could use some hints on
 myself.  A qual is a filter condition, e.g. in SELECT * FROM tbl WHERE
 x = 1, the x = 1 part is a qual.   It's helpful to grep src/include
 for the structures in question; the information they contain often
 helps to understand their purpose.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 1:59 PM, Vaibhav Kaushal
vaibhavkaushal...@gmail.com wrote:
 Thanks a lot for the answer. I see a lot of people discussing so many things
 so I thought my email would have been ignored by those with a lot coming in
 already. Thanks for the enlightenment.

Sure.  Just FYI, with these kinds of things, I do try to keep track of
them because I like to try to make sure that everyone gets an answer
back.  But, I don't always have time to do it immediately, and
sometimes I wait even if I do, if I think that someone more
knowledgeable on that particular topic might chime in.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)

2010-11-20 Thread Robert Haas

On Fri, Nov 19, 2010 at 5:59 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 But what about timings vs. random other stuff?  Like in this case
 there's a problem if the signal arrives before the memory update to
 latch-is_set becomes visible.  I don't know what we need to do to
 guarantee that.

 I don't believe there's an issue there.  A context swap into the kernel
 is certainly going to include msync.  If you're afraid otherwise, you
 could put an msync before the kill() call, but I think it's a waste of
 effort.

So what DO we need to guard against here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?

2010-11-20 Thread Vaibhav Kaushal

I sure do respect that. Appreciated. :)

-Vaibhav (*_*)

On Sun, Nov 21, 2010 at 12:33 AM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Nov 20, 2010 at 1:59 PM, Vaibhav Kaushal
 vaibhavkaushal...@gmail.com wrote:
  Thanks a lot for the answer. I see a lot of people discussing so many
 things
  so I thought my email would have been ignored by those with a lot coming
 in
  already. Thanks for the enlightenment.

 Sure.  Just FYI, with these kinds of things, I do try to keep track of
 them because I like to try to make sure that everyone gets an answer
 back.  But, I don't always have time to do it immediately, and
 sometimes I wait even if I do, if I think that someone more
 knowledgeable on that particular topic might chime in.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Andres Freund

On Saturday 20 November 2010 18:34:04 Tom Lane wrote:
 BTW, while we're thinking about marginal improvements: instead of
 constructing the string backwards and then reversing it in-place,
 what about building it working backwards from the end of the buffer
 and then memmove'ing it down to the start of the buffer?
 
 I haven't tested this but it seems likely to be roughly a wash
 speed-wise.  The reason I find the idea attractive is that it will
 immediately expose any caller that is providing a buffer shorter
 than the required length, whereas now such callers will appear to
 work fine if they're only tested on small values.
Tried that, the cost was measurable although not big (~3-5%)...

Greetings,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Andres Freund

On Saturday 20 November 2010 18:18:32 Robert Haas wrote:
  Likewise for the int64 case, which BTW is no safer for pre-C99 compilers
  than it was yesterday: LL is not the portable way to write int64
  constants.
 Gah.  I wish we had some documentation of this stuff.
Dito. I started doing Cish stuff quite a bit *after* C99 was mostly available 
in gcc...

Sorry btw, for not realizing those points (and the regression-expectation file) 
myself...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Greg Stark

On Sat, Nov 20, 2010 at 6:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sat, Nov 20, 2010 at 12:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 what about building it working backwards from the end of the buffer
 and then memmove'ing it down to the start of the buffer?

 I think that might be more clever than is really warranted.  I get
 your point about buffer overrun, but I don't think it's that hard for
 callers to do the right thing, so I'm inclined to think that's not
 worth much in this case.

It also seems wrong that a caller might happen to know that their
argument will never be more than n digits but still has to allocate a
buffer large enough to hold 2^64.




 Fair enough --- it was just a passing thought.

 I had given some thought to whether it might make sense to try to
 figure out how long the string will be before we actually start
 generating it, so that we can just start in the exactly right space
 and have to clean up afterward.  But the obvious implementation seems
 like it could be more expensive than just doing the copy.

 Yeah.  You certainly don't want to do the division sequence twice,
 and a log() call wouldn't be cheap either, and there don't seem to
 be many other alternatives.

There are bittwiddling hacks for computing log based 2. I'm not sure
it's worth worrying about to this degree though.



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Custom code int(32|64) = text conversions out of performance reasons

2010-11-20 Thread Tom Lane

Greg Stark gsst...@mit.edu writes:
 On Sat, Nov 20, 2010 at 6:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 I had given some thought to whether it might make sense to try to
 figure out how long the string will be before we actually start
 generating it, so that we can just start in the exactly right space
 and have to clean up afterward.  But the obvious implementation seems
 like it could be more expensive than just doing the copy.

 Yeah.  You certainly don't want to do the division sequence twice,
 and a log() call wouldn't be cheap either, and there don't seem to
 be many other alternatives.

 There are bittwiddling hacks for computing log based 2. I'm not sure
 it's worth worrying about to this degree though.

I think converting log2 to log10 *exactly* would end up being not so
cheap, anyhow.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] duplicate connection failure messages

2010-11-20 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  I was wondering that.  I am unclear if we need it though --- can we not
  assume inet_ntop() exists on all systems?  We assumed inet_ntoa() did. 
 
 The Single Unix Spec includes inet_ntoa but not inet_ntop.
 
  Of course, the buildfarm will tell us.
 
 The buildfarm unfortunately contains only a subset of the platforms
 we care about.  I don't think this problem is large enough to justify
 taking a portability risk by depending on non-SUS library functions.
 
 If you want to do this, please do it as suggested previously, ie depend
 on the copy of the code we have internally.

I assume you are suggesting to use our inet_net_ntop() even if the
system has inet_ntop().

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?

2010-11-20 Thread Tom Lane

Some additional comments ...

Robert Haas robertmh...@gmail.com writes:
 With regards to your question, for each type of plan node, there is an
 associated plan state node.  This is an important distinction
 because, IIUC, plans can be reused, so plan state contains the
 information that might need to be reset on each run.

Yeah.  The plan tree is supposed to be read-only so far as the executor
is concerned, so it needs a parallel plan state tree to hold its
runtime state.

 I believe a tuple projection
 is what you get when you do something like SELECT
 generate_series(1,10) FROM tbl - the set-returning function has to be
 projected in multiple tuples.

I think projection is a term out of relational theory, referring to
any sort of tuple-by-tuple manipulation of the data.  For instance,
if you have a row a, b, c and you compute a, c+1 based on that,
that's a projection.  ExecProject() does this in the general case.
The business with SRFs in a targetlist possibly producing multiple
rows is a PostQUEL-ism that doesn't have any standard technical name
that I know of.

 A qual is a filter condition, e.g. in SELECT * FROM tbl WHERE
 x = 1, the x = 1 part is a qual.

qual = qualifier.  We use that term with various shades of meaning;
notably an indexqual is a qualifier that is useful for searching an
index.  Large parts of the planner also use the term clause to mean
about the same thing.

 It's helpful to grep src/include
 for the structures in question;

Here's my single biggest tip for newcomers to the Postgres source:
if you don't use ctags, glimpse, or some other tool that can quickly
show you all references to a given identifier, go out and get one.
It's one of the easiest ways to learn about things.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for seg picksplit function

2010-11-20 Thread Yeb Havinga


On 2010-11-20 13:36, Yeb Havinga wrote:

On 2010-11-20 04:46, Robert Haas wrote:

Well, the problem with just comparing on  is that it takes very
little account of the upper bounds.  I think the cases where a simple
split would hurt you the most are those where examining the upper
bound is necessary to to get a good split.
With the current 8K default blocksize, I put my money on the sorting 
algorithm for any seg case. The r-tree algorithm's performance is 
probably more influenced by large buckets-low tree depth-generic 
keys on non leaf nodes.
To test this conjecture I compared a default 9.0.1 postgres (with 
debugging) to exactly the same postgres but with an 1K blocksize, with 
the test that Alexander posted upthread.


8K blocksize:
postgres=# create index seg_test_idx on seg_test using gist (a);
CREATE INDEX
Time: 99613.308 ms
SELECT
Total runtime: 81.482 ms

1K blocksize:
CREATE INDEX
Time: 40113.252 ms
SELECT
Total runtime: 3.363 ms

Details of explain analyze are below. The rowcount results are not 
exactly the same because I forgot to backup the first test, so created 
new random data.
Though I didn't compare the sorting picksplit this way, I suspect that 
that algorithm won't be effected so much by the difference in blocksize.


regards,
Yeb Havinga


**   8K test
ostgres=# \timing
Timing is on.
postgres=# create index seg_test_idx on seg_test using gist (a);
CREATE INDEX
Time: 99613.308 ms
postgres=# show block_size ;
 block_size

 8192
(1 row)

Time: 0.313 ms
postgres=# explain (buffers, analyze) select * from seg_test where a @ 
'0.5 .. 0.5'::seg;

 QUERY PLAN

---
-
 Bitmap Heap Scan on seg_test  (cost=44.32..2589.66 rows=1000 width=12) 
(actual time=91.061..91.304 rows=27 loops=1)

   Recheck Cond: (a @ '0.5'::seg)
   Buffers: shared hit=581 read=1729 written=298
   -  Bitmap Index Scan on seg_test_idx  (cost=0.00..44.07 rows=1000 
width=0) (actual time=91.029..91.029 rows=27 loop

s=1)
 Index Cond: (a @ '0.5'::seg)
 Buffers: shared hit=581 read=1702 written=297
 Total runtime: 91.792 ms
(7 rows)

Time: 309.687 ms
postgres=# explain (buffers, analyze) select * from seg_test where a @ 
'0.5 .. 0.5'::seg;

 QUERY PLAN

---
-
 Bitmap Heap Scan on seg_test  (cost=44.32..2589.66 rows=1000 width=12) 
(actual time=81.357..81.405 rows=27 loops=1)

   Recheck Cond: (a @ '0.5'::seg)
   Buffers: shared hit=1231 read=1079
   -  Bitmap Index Scan on seg_test_idx  (cost=0.00..44.07 rows=1000 
width=0) (actual time=81.337..81.337 rows=27 loop

s=1)
 Index Cond: (a @ '0.5'::seg)
 Buffers: shared hit=1204 read=1079
 Total runtime: 81.482 ms
(7 rows)

Time: 82.291 ms

**   1K test
postgres=# \timing
Timing is on.
postgres=# create index seg_test_idx on seg_test using gist (a);
CREATE INDEX
Time: 40113.252 ms
postgres=# explain (buffers, analyze) select * from seg_test where a @ 
'0.5 .. 0.5'::seg;

QUERY PLAN

---

 Bitmap Heap Scan on seg_test  (cost=278.66..3812.85 rows=1000 
width=12) (actual time=4.649..4.839 rows=34 loops=1)

   Recheck Cond: (a @ '0.5'::seg)
   Buffers: shared hit=221 read=385
   -  Bitmap Index Scan on seg_test_idx  (cost=0.00..278.41 rows=1000 
width=0) (actual time=4.620..4.620 rows=34 loops

=1)
 Index Cond: (a @ '0.5'::seg)
 Buffers: shared hit=221 read=351
 Total runtime: 4.979 ms
(7 rows)

Time: 6.217 ms
postgres=# explain (buffers, analyze) select * from seg_test where a @ 
'0.5 .. 0.5'::seg;

QUERY PLAN

---

 Bitmap Heap Scan on seg_test  (cost=278.66..3812.85 rows=1000 
width=12) (actual time=3.239..3.310 rows=34 loops=1)

   Recheck Cond: (a @ '0.5'::seg)
   Buffers: shared hit=606
   -  Bitmap Index Scan on seg_test_idx  (cost=0.00..278.41 rows=1000 
width=0) (actual time=3.219..3.219 rows=34 loops

=1)
 Index Cond: (a @ '0.5'::seg)
 Buffers: shared hit=572
 Total runtime: 3.363 ms
(7 rows)

Time: 4.063 ms
postgres=# show block_size;
 block_size

 1024
(1 row)

Time: 0.300 ms




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Latches with weak memory ordering (Re: [HACKERS] max_wal_senders must die)

2010-11-20 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 So what DO we need to guard against here?

I think the general problem can be stated as process A changes two or
more values in shared memory in a fairly short span of time, and process
B, which is concurrently examining the same variables, sees those
changes occur in a different order than A thought it made them in.

In practice we do not need to worry about changes made with a kernel
call in between, as any sort of context swap will cause the kernel to
force cache synchronization.

Also, the intention is that the locking primitives will take care of
this for any shared structures that are protected by a lock.  (There
were some comments upthread suggesting maybe our lock code is not
bulletproof; but if so that's something to fix in the lock code, not
a logic error in code using the locks.)

So what this boils down to is being an issue for shared data structures
that we access without using locks.  As, for example, the latch
structures.

The other case that I can think of offhand is the signal multiplexing
flags.  I think we're all right there so far as the flags themselves are
concerned because only one atomic update is involved on each side:
there's no possibility of inconsistency due to cache visibility skew.
But we'd be at some risk if we were using any such flag as a cue to go
look at some other shared-memory state.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix for seg picksplit function

2010-11-20 Thread Yeb Havinga


On 2010-11-20 21:57, Yeb Havinga wrote:8K blocksize:

postgres=# create index seg_test_idx on seg_test using gist (a);
CREATE INDEX
Time: 99613.308 ms
SELECT
Total runtime: 81.482 ms

1K blocksize:
CREATE INDEX
Time: 40113.252 ms
SELECT
Total runtime: 3.363 ms

Details of explain analyze are below. The rowcount results are not 
exactly the same because I forgot to backup the first test, so created 
new random data.
Though I didn't compare the sorting picksplit this way, I suspect that 
that algorithm won't be effected so much by the difference in blocksize.
Here are the results for a 1K blocksize (debug enabled) and Alexanders 
latest (0.5) patch.


postgres=# create index seg_test_idx on seg_test using gist (a);
CREATE INDEX
Time: 37373.398 ms
postgres=# explain (buffers, analyze) select * from seg_test where a @ 
'0.5 .. 0.5'::seg;

QUERY PLAN
---
 Bitmap Heap Scan on seg_test  (cost=209.97..3744.16 rows=1000 
width=12) (actual time=0.091..0.283 rows=34 loops=1)

   Recheck Cond: (a @ '0.5'::seg)
   Buffers: shared hit=6 read=35
   -  Bitmap Index Scan on seg_test_idx  (cost=0.00..209.72 rows=1000 
width=0) (actual time=0.071..0.071 rows=34 loops=1)

 Index Cond: (a @ '0.5'::seg)
 Buffers: shared hit=6 read=1
 Total runtime: 0.392 ms
(7 rows)

Time: 1.798 ms
postgres=# explain (buffers, analyze) select * from seg_test where a @ 
'0.5 .. 0.5'::seg;

QUERY PLAN
---
 Bitmap Heap Scan on seg_test  (cost=209.97..3744.16 rows=1000 
width=12) (actual time=0.087..0.160 rows=34 loops=1)

   Recheck Cond: (a @ '0.5'::seg)
   Buffers: shared hit=41
   -  Bitmap Index Scan on seg_test_idx  (cost=0.00..209.72 rows=1000 
width=0) (actual time=0.068..0.068 rows=34 loops=1)

 Index Cond: (a @ '0.5'::seg)
 Buffers: shared hit=7
 Total runtime: 0.213 ms
(7 rows)

Time: 0.827 ms



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] duplicate connection failure messages

2010-11-20 Thread Tom Lane

Bruce Momjian br...@momjian.us writes:
 I assume you are suggesting to use our inet_net_ntop() even if the
 system has inet_ntop().

If you're going to have code to do the former, it doesn't seem to be
worth the trouble to also have code that does the latter ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?

2010-11-20 Thread Dimitri Fontaine

Tom Lane t...@sss.pgh.pa.us writes:
 Here's my single biggest tip for newcomers to the Postgres source:
 if you don't use ctags, glimpse, or some other tool that can quickly
 show you all references to a given identifier, go out and get one.
 It's one of the easiest ways to learn about things.

I've been using cscope (out of advice from Joshua Tolley), and even
better its integration into Emacs which is called xcscope.el --- I
wouldn't have been able to come up with the extension patch series
without that.

  http://cscope.sourceforge.net/

And as a quick teaser, the keys I mostly use:

  C-c s d   cscope-find-global-definition
  C-c s s   cscope-find-this-symbol
  C-c s f   cscope-find-this-file
  C-c s I   cscope-index-files

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] duplicate connection failure messages

2010-11-20 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  I assume you are suggesting to use our inet_net_ntop() even if the
  system has inet_ntop().
 
 If you're going to have code to do the former, it doesn't seem to be
 worth the trouble to also have code that does the latter ...

OK, we will not call inet_ntop() at all.  I moved the CIDR part of
adt/inet_net_ntop.c into adt/inet_cidr_ntop.c, and moved the remaining
net part to /port/inet_net_ntop.c.

I then changed all uses of inet_ntoa to use inet_net_ntop().  While this
churn would perhaps not be warranted just to allow for better error
messages, I found pg_getaddrinfo_all() being called from
libpq::connectDBStart(), which makes it not thread-safe.  I am not
excited about backpatching it but it is a threading bug.

The output is as expected:

$ psql -h localhost test
psql: could not connect to server: Connection refused
Is the server running on host localhost (127.0.0.1) and 
accepting
TCP/IP connections on port 5432?
$ psql -h 127.0.0.1 test
psql: could not connect to server: Connection refused
Is the server running on host 127.0.0.1 and accepting
TCP/IP connections on port 5432?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index a911c50..814c27a 100644
*** /tmp/pgrevert.8208/Z2NPcb_libpq.sgml	Sat Nov 20 18:06:46 2010
--- doc/src/sgml/libpq.sgml	Sat Nov 20 17:11:04 2010
*** PGconn *PQconnectdbParams(const char **k
*** 170,180 
 If literalhost/ is specified without literalhostaddr/,
 a host name lookup occurs.
 If literalhostaddr/ is specified without literalhost/,
!the value for literalhostaddr/ gives the server address.
 The connection attempt will fail in any of the cases where a
 host name is required.
 If both literalhost/ and literalhostaddr/ are specified,
!the value for literalhostaddr/ gives the server address.
 The value for literalhost/ is ignored unless needed for
 authentication or verification purposes, in which case it will be
 used as the host name.  Note that authentication is likely to fail
--- 170,180 
 If literalhost/ is specified without literalhostaddr/,
 a host name lookup occurs.
 If literalhostaddr/ is specified without literalhost/,
!the value for literalhostaddr/ gives the server network address.
 The connection attempt will fail in any of the cases where a
 host name is required.
 If both literalhost/ and literalhostaddr/ are specified,
!the value for literalhostaddr/ gives the server network address.
 The value for literalhost/ is ignored unless needed for
 authentication or verification purposes, in which case it will be
 used as the host name.  Note that authentication is likely to fail
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index be272b5..ce28abd 100644
*** /tmp/pgrevert.8208/zxOcga_Makefile	Sat Nov 20 18:06:46 2010
--- src/backend/utils/adt/Makefile	Sat Nov 20 17:11:04 2010
*** OBJS = acl.o arrayfuncs.o array_userfunc
*** 23,29 
  	oid.o oracle_compat.o pseudotypes.o rowtypes.o \
  	regexp.o regproc.o ruleutils.o selfuncs.o \
  	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
! 	network.o mac.o inet_net_ntop.o inet_net_pton.o \
  	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
  	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
  	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
--- 23,29 
  	oid.o oracle_compat.o pseudotypes.o rowtypes.o \
  	regexp.o regproc.o ruleutils.o selfuncs.o \
  	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
! 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
  	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
  	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
  	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
diff --git a/src/include/port.h b/src/include/port.h
index 2048768..0cfbed8 100644
*** /tmp/pgrevert.8208/DZNlKd_port.h	Sat Nov 20 18:06:46 2010
--- src/include/port.h	Sat Nov 20 17:11:42 2010
*** extern void qsort_arg(void *base, size_t
*** 437,440 
--- 437,444 
  /* port/chklocale.c */
  extern int	pg_get_encoding_from_locale(const char *ctype);
  
+ /* port/inet_net_ntop.c */
+ extern char *inet_net_ntop(int af, const void *src, int bits,
+ 			  char *dst, size_t size);
+ 
  #endif   /* PG_PORT_H */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index ae267ab..88eac70

Re: [HACKERS] Spread checkpoint sync

2010-11-20 Thread Jeff Janes

On Mon, Nov 15, 2010 at 6:15 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Nov 14, 2010 at 6:48 PM, Greg Smith g...@2ndquadrant.com wrote:
 The second issue is that the delay between sync calls is currently
 hard-coded, at 3 seconds.  I believe the right path here is to consider the
 current checkpoint_completion_target to still be valid, then work back from
 there.  That raises the question of what percentage of the time writes
 should now be compressed into relative to that, to leave some time to spread
 the sync calls.  If we're willing to say writes finish in first 1/2 of
 target, syncs execute in second 1/2, that I could implement that here.
  Maybe that ratio needs to be another tunable.  Still thinking about that
 part, and it's certainly open to community debate.

I would speculate that the answer is likely to be nearly binary.  The
best option would either be to do the writes as fast as possible and
spread out the fsyncs, or spread out the writes and do the fsyncs as
fast as possible.  Depending on the system set up.


 The thing to realize
 that complicates the design is that the actual sync execution may take a
 considerable period of time.  It's much more likely for that to happen than
 in the case of an individual write, as the current spread checkpoint does,
 because those are usually cached.  In the spread sync case, it's easy for
 one slow sync to make the rest turn into ones that fire in quick succession,
 to make up for lost time.

 I think the behavior of file systems and operating systems is highly
 relevant here.  We seem to have a theory that allowing a delay between
 the write and the fsync should give the OS a chance to start writing
 the data out,

I thought that the theory was that doing too many fsync in short order
can lead to some kind of starvation of other IO.

If the theory is that we want to wait between writes and fsyncs, then
the current behavior is probably the best, Spreading out the writes
and then doing all the syncs at the end gives the best delay time
between an average write and the sync of that written to file.  Or,
spread the writes out over 150 seconds, sleep for 140 seconds, then do
the fsyncs.  But I don't think that that is the theory.


 but do we have any evidence indicating whether and under
 what circumstances that actually occurs?  For example, if we knew that
 it's important to wait at least 30 s but waiting 60 s is no better,
 that would be useful information.

 Another question I have is about how we're actually going to know when
 any given fsync can be performed.  For any given segment, there are a
 certain number of pages A that are already dirty at the start of the
 checkpoint.

Dirty in the shared pool, or dirty in the OS cache?

 Then there are a certain number of additional pages B
 that are going to be written out during the checkpoint.  If it so
 happens that B = 0, we can call fsync() at the beginning of the
 checkpoint without losing anything (in fact, we gain something: any
 pages dirtied by cleaning scans or backend writes during the
 checkpoint won't need to hit the disk;

Aren't those pages written out by cleaning scans and backend writes
while the checkpoint is occurring exactly what you defined to be page
set B, and then to be zero?

 and if the filesystem dumps
 more of its cache than necessary on fsync, we may as well take that
 hit before dirtying a bunch more stuff).  But if B  0, then we should
 attempt the fsync() until we've written them all; otherwise we'll end
 up having to fsync() that segment twice.

 Doing all the writes and then all the fsyncs meets this requirement
 trivially, but I'm not so sure that's a good idea.  For example, given
 files F1 ... Fn with dirty pages needing checkpoint writes, we could
 do the following: first, do any pending fsyncs for files not among F1
 .. Fn; then, write all pages for F1 and fsync, write all pages for F2
 and fsync, write all pages for F3 and fsync, etc.  This might seem
 dumb because we're not really giving the OS a chance to write anything
 out before we fsync, but think about the ext3 case where the whole
 filesystem cache gets flushed anyway.  It's much better to dump the
 cache at the beginning of the checkpoint and then again after every
 file than it is to spew many GB of dirty stuff into the cache and then
 drop the hammer.

But the kernel has knobs to prevent that from happening.
dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
is supposed to do a journal commit every 5 seconds under default mount
conditions.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] knngist - 0.8

2010-11-20 Thread Robert Haas

2010/11/12 Teodor Sigaev teo...@sigaev.ru:
 Slightly-more-fleshed out proof of concept patch attached, with actual
 syntax, documentation, and pg_dump support added.  This might be
 thought of as a subset of the builtin_knngist_core patch, without the
 parts that make it actually do something useful (which is mostly
 match_pathkey_to_index - which I'm still rather hoping to abstract in
 some way via the access method interface, though I'm currently unsure
 what the best way to do that is).

 I don't see in your patch how to propagate knowledge of kind of operation
 (AMOP_SEARCH or AMOP_ORDER) to GiST and consistent method. For both of them
 they aren't distinguishable. That's not acceptably for both, because GiST
 needs to choose right traversal algorithm, consistentFn needs role to decide
 return infinity or false (-1) value.

My version of the patch is suffering from a bad case of not being
done.  It sort of started out as a proof-of-concept to show what the
syntax might look like and seems to be gradually growing into
something more real.  I'm not sure what we'll end up with.

 My variants informs GiST by SK_ORDER flags and consistentFn looks at
 strategy number (strategy numbers are different for different purposes).

Yeah.  At ten thousand feet, I think the open design question here is
to what extent it's OK to rely on the fact that the ORDER BY clauses
we wish to optimize happen to look a lot like the WHERE clauses we
already know how to optimize: namely, they're both binary opclauses of
the form indexed-column op constant.  Your patch manages to
reuse a LOT of existing machinery by shoving ordering expressions
through the same code paths that quals take.  Code reuse is generally
a good thing, but here's we're forming RestrictInfo and ScanKey
objects out of things that are neither restrictions nor keys, which
might lead to maintainability problems down the road.  I'd like to get
some input from Tom on how he feels about that, and any alternatives
he sees.

It seems to me that our concept of ScanDirection is really woefully
under-expressive.  For example, given:

CREATE TABLE foo (
id integer NOT NULL,
name character varying NOT NULL,
PRIMARY KEY (id)
);

We use the index for the first of these but not the second:

select * from foo order by id nulls last;
select * from foo order by id nulls first;

In an ideal world, we'd like to handle the second one by finding the
first non-NULL entry in the index, scanning away from the NULLs, and
then, when we run off the end, looping back around to spit out the
NULL entries.  Or, similarly, if we have an index on (id, name), we
can handle the first two of the following with the index, but not the
second two:

select * from foo order by id, name;
select * from foo order by id desc, name desc;
select * from foo order by id, name desc;
select * from foo order by id, name nulls last;
(can't handle this last one even if name is marked NOT NULL!)

It seems like it might even be possible to handle these (albeit maybe
not real efficiently) via a sufficiently complicated scanning order,
but even if we had the code to do the scan, we have no way of telling
the index what scan direction we want: forward, backward, and no
movement don't really cut it.  The idea that forward and backward mean
anything at all is really pretty btree-centric.

The trick, of course, is to come up with something better.  I have a
vague notion of using something like an array or list of objects of
the form index column, asc/desc, nulls first/last, value (null except
for knngist).  That could easily be extended with collation or
whatever as needed.  The trick is - you need to create this object and
then ask the index - hey, is this something you can support?  And the
index needs to then respond by saying whether it can do that (and
maybe doing some sort of translation into what instructions should be
provided at execution time).

Trouble is, that sounds like a lot of work on code I'm not altogether
comfortable with, and I'd really like to get KNNGIST in shape for 9.1
if possible.  I'm not real sure where to draw the line between, on the
one hand, not wanting to commit something that is excessively crufty
and, on the other hand, wanting to finish in finite time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Latches with weak memory ordering (Re: max_wal_senders must die)

2010-11-20 Thread Greg Stark

On Sat, Nov 20, 2010 at 9:07 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In practice we do not need to worry about changes made with a kernel
 call in between, as any sort of context swap will cause the kernel to
 force cache synchronization.


Note that not all kernel calls are equal these days. Some
(gettimeofday on Linux) are implemented as very lightweight calls that
don't do memory map changes and might not trigger the kinds of
synchronization you're talking about. Any IPC kernel calls like kill
will though I'm sure.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spread checkpoint sync

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 The thing to realize
 that complicates the design is that the actual sync execution may take a
 considerable period of time.  It's much more likely for that to happen than
 in the case of an individual write, as the current spread checkpoint does,
 because those are usually cached.  In the spread sync case, it's easy for
 one slow sync to make the rest turn into ones that fire in quick succession,
 to make up for lost time.

 I think the behavior of file systems and operating systems is highly
 relevant here.  We seem to have a theory that allowing a delay between
 the write and the fsync should give the OS a chance to start writing
 the data out,

 I thought that the theory was that doing too many fsync in short order
 can lead to some kind of starvation of other IO.

 If the theory is that we want to wait between writes and fsyncs, then
 the current behavior is probably the best, Spreading out the writes
 and then doing all the syncs at the end gives the best delay time
 between an average write and the sync of that written to file.  Or,
 spread the writes out over 150 seconds, sleep for 140 seconds, then do
 the fsyncs.  But I don't think that that is the theory.

Well, I've heard Bruce and, I think, possibly also Greg talk about
wanting to wait after doing the writes in the hopes that the kernel
will start to flush the dirty pages, but I'm wondering whether it
wouldn't be better to just give up on that and do: small batch of
writes - fsync those writes - another small batch of writes - fsync
that batch - etc.

 but do we have any evidence indicating whether and under
 what circumstances that actually occurs?  For example, if we knew that
 it's important to wait at least 30 s but waiting 60 s is no better,
 that would be useful information.

 Another question I have is about how we're actually going to know when
 any given fsync can be performed.  For any given segment, there are a
 certain number of pages A that are already dirty at the start of the
 checkpoint.

 Dirty in the shared pool, or dirty in the OS cache?

OS cache, sorry.

 Then there are a certain number of additional pages B
 that are going to be written out during the checkpoint.  If it so
 happens that B = 0, we can call fsync() at the beginning of the
 checkpoint without losing anything (in fact, we gain something: any
 pages dirtied by cleaning scans or backend writes during the
 checkpoint won't need to hit the disk;

 Aren't those pages written out by cleaning scans and backend writes
 while the checkpoint is occurring exactly what you defined to be page
 set B, and then to be zero?

No, sorry, I'm referring to cases where all the dirty pages in a
segment have been written out the OS but we have not yet issued the
necessary fsync.

 and if the filesystem dumps
 more of its cache than necessary on fsync, we may as well take that
 hit before dirtying a bunch more stuff).  But if B  0, then we should
 attempt the fsync() until we've written them all; otherwise we'll end
 up having to fsync() that segment twice.

 Doing all the writes and then all the fsyncs meets this requirement
 trivially, but I'm not so sure that's a good idea.  For example, given
 files F1 ... Fn with dirty pages needing checkpoint writes, we could
 do the following: first, do any pending fsyncs for files not among F1
 .. Fn; then, write all pages for F1 and fsync, write all pages for F2
 and fsync, write all pages for F3 and fsync, etc.  This might seem
 dumb because we're not really giving the OS a chance to write anything
 out before we fsync, but think about the ext3 case where the whole
 filesystem cache gets flushed anyway.  It's much better to dump the
 cache at the beginning of the checkpoint and then again after every
 file than it is to spew many GB of dirty stuff into the cache and then
 drop the hammer.

 But the kernel has knobs to prevent that from happening.
 dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
 kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
 is supposed to do a journal commit every 5 seconds under default mount
 conditions.

I don't know in detail.  dirty_expire_centisecs sounds useful; I think
the problem with dirty_background_ratio and dirty_ratio is that the
default ratios are large enough that on systems with a huge pile of
memory, they allow more dirty data to accumulate than can be flushed
without causing an I/O storm.  I believe Greg Smith made a comment
along the lines of - memory sizes are grow faster than I/O speeds;
therefore a ratio that is OK for a low-end system with a modest amount
of memory causes problems on a high-end system that has faster I/O but
MUCH more memory.

As a kernel developer, I suspect the tendency is to try to set the
ratio so that you keep enough free memory around to service future
allocation requests.  Optimizing for possible future fsyncs is
probably not the top priority...

--

Re: [HACKERS] Spread checkpoint sync

2010-11-20 Thread Jeff Janes

On Sat, Nov 20, 2010 at 5:17 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 Doing all the writes and then all the fsyncs meets this requirement
 trivially, but I'm not so sure that's a good idea.  For example, given
 files F1 ... Fn with dirty pages needing checkpoint writes, we could
 do the following: first, do any pending fsyncs for files not among F1
 .. Fn; then, write all pages for F1 and fsync, write all pages for F2
 and fsync, write all pages for F3 and fsync, etc.  This might seem
 dumb because we're not really giving the OS a chance to write anything
 out before we fsync, but think about the ext3 case where the whole
 filesystem cache gets flushed anyway.  It's much better to dump the
 cache at the beginning of the checkpoint and then again after every
 file than it is to spew many GB of dirty stuff into the cache and then
 drop the hammer.

 But the kernel has knobs to prevent that from happening.
 dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
 kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
 is supposed to do a journal commit every 5 seconds under default mount
 conditions.

 I don't know in detail.  dirty_expire_centisecs sounds useful; I think
 the problem with dirty_background_ratio and dirty_ratio is that the
 default ratios are large enough that on systems with a huge pile of
 memory, they allow more dirty data to accumulate than can be flushed
 without causing an I/O storm.

True, but I think that changing these from their defaults is not
considered to be a dark art reserved for kernel hackers, i.e they are
something that sysadmins are expected to tweak to suite their work
load, just like the shmmax and such.  And for very large memory
systems, even 1% may be too much to cache (dirty*_ratio can only be
set in integer percent points), so recent kernels introduced
dirty*_bytes parameters.  I like these better because they do what
they say.  With the dirty*_ratio, I could never figure out what it was
a ratio of, and the results were unpredictable without extensive
experimentation.

 I believe Greg Smith made a comment
 along the lines of - memory sizes are grow faster than I/O speeds;
 therefore a ratio that is OK for a low-end system with a modest amount
 of memory causes problems on a high-end system that has faster I/O but
 MUCH more memory.

Yes, but how much work do we want to put into redoing the checkpoint
logic so that the sysadmin on a particular OS and configuration and FS
can avoid having to change the kernel parameters away from their
defaults?  (Assuming of course I am correctly understanding the
problem, always a dangerous assumption.)

Some experiments I have just done show that dirty_expire_centisecs
does not seem reliable on ext3, but the dirty*_ratio and dirty*_bytes
seem reliable on ext2, ext3, and ext4.

But that may not apply to RAID, I don't have one I can test.


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNNEST ... WITH ORDINALITY (AND POSSIBLY OTHER STUFF)

2010-11-20 Thread David Fetter

On Sat, Nov 20, 2010 at 01:54:32PM +0900, Itagaki Takahiro wrote:
 On Sat, Nov 20, 2010 at 03:48,  caleb.wel...@emc.com wrote:
  Note the standard also supports unnesting multiple arrays concurrently, the 
  rule for handling arrays with different lengths is to use null padding of 
  the shorter array.
 
    UNNEST( ARRAY[5,2,3,4],
            ARRAY['hello', 'world'] )
    WITH ORDINALITY AS t(a,b,i);
 
 Hmmm, that means we cannot support multi-array unnest() with our
 generic aggregate functions. The function prototype might be like
 below, but we don't support such definition.
 
   unnest(anyarray1, anyarray2, ...,
  OUT anyelement1, OUT anyelement2, ...)
   RETURNS SETOF record
 
 So, we would need a special representation for multi-array unnest().

Using bits we already have, I came up with a way to do the things
UNNEST(multiple, arrays, here) WITH ORDINALITY does.  At least in
theory, this is a matter of silently engaging the rewrite rule system:

\set foo ARRAY[1,2,4,8]
\set bar ARRAY['Here','is','some','odd','text','of','a','different','length']
\set baz ARRAY['Here','is','yet','more','text']

WITH x AS (
SELECT row_number() OVER () i, foo
FROM UNNEST(:foo) foo
),
y AS (
SELECT row_number() OVER () i, bar
FROM UNNEST(:bar) bar
),
z AS (
SELECT row_number() OVER () i, baz
FROM UNNEST(:baz) baz
)
SELECT * FROM x FULL JOIN y USING(i) FULL JOIN z USING(i);

a i | foo |bar| baz  
---+-+---+--
 1 |   1 | Here  | Here
 2 |   2 | is| is
 3 |   4 | some  | yet
 4 |   8 | odd   | more
 5 | | text  | text
 6 | | of| 
 7 | | a | 
 8 | | different | 
 9 | | length| 
(9 rows)

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: patch: format function - fixed oid

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 7:29 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 2010/11/20 Robert Haas robertmh...@gmail.com:
 On Sat, Nov 20, 2010 at 2:59 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 I think so you are a correct. Who will a submit this final version? You or 
 me?

 I've got it.


 Thank you

 nice a day

OK, I've committed this, after a fairly heavy rewrite.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER OBJECT any_name SET SCHEMA name

2010-11-20 Thread Robert Haas

On Thu, Nov 4, 2010 at 3:39 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Tom Lane t...@sss.pgh.pa.us writes:
 Not having read the patch, but ... the idea that was in the back of
 my mind was to have a generic AlterObjectNamespace function that
 would take parameters approximately like the following:

 Please find attached what I came up with, that's the set_schema patch
 version 6.

Looks good overall, but:

In AlterObjectNamespace(), you reference ACL_KIND_CONVERSION, which I
suspect actually needs to be yet another parameter to that function.
I've had the thought before that we have a deplorable number of
different ways of referring to object types in the back end:
ACL_KIND_*, OCLASS_*, OBJECT_*, and class IDs.  We should maybe look
at unifying some or all of those.

getObjectDescriptionOids() needs a prototype somewhere.

And probably most significantly, you need to add docs and regression tests.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ALTER OBJECT any_name SET SCHEMA name

2010-11-20 Thread Robert Haas

On Sat, Nov 20, 2010 at 11:22 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Nov 4, 2010 at 3:39 PM, Dimitri Fontaine dimi...@2ndquadrant.fr 
 wrote:
 Tom Lane t...@sss.pgh.pa.us writes:
 Not having read the patch, but ... the idea that was in the back of
 my mind was to have a generic AlterObjectNamespace function that
 would take parameters approximately like the following:

 Please find attached what I came up with, that's the set_schema patch
 version 6.

 Looks good overall, but:

 In AlterObjectNamespace(), you reference ACL_KIND_CONVERSION, which I
 suspect actually needs to be yet another parameter to that function.
 I've had the thought before that we have a deplorable number of
 different ways of referring to object types in the back end:
 ACL_KIND_*, OCLASS_*, OBJECT_*, and class IDs.  We should maybe look
 at unifying some or all of those.

 getObjectDescriptionOids() needs a prototype somewhere.

 And probably most significantly, you need to add docs and regression tests.

Ah, nuts.  I see now there's a v7.  Never mind...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: What do these terms mean in the SOURCE CODE?

2010-11-20 Thread Vaibhav Kaushal

Thanks for your concerns about my question. I am really happy to get the
answers.

@Mr. Tom Lane: Thanks for your explanation but I am already using Eclipse
which gets me to the definitions real easy. I do not post my questions on
the Hackers forum with any intention to disturb anyone. I just ask for help
when I am not able to understand it all. PG is my first attempt at OSS
development so it is particularly difficult. Of course PG has great support
for us because almost everything has got its comment, elaborating its use /
importance. However since I don't know everything yet about PG or even about
the OSS development, I come here asking. I have never been in so much of
code with so many Data Structures and typedefs. It gets confusing very
easily. This is the reason I came here asking.

I ask here after I have done my homework; and after I get messed up with the
code. Moreover the comments are those of developers. So when someone says
that expression does not match qual in the code, it is not so clear to me
as it might be to you people; because you know the terms, I don't. Thanks
for all the support I have been getting here to all of you. I will try to
ask as less as possible from now on.

@Mr. Dimitri Fontaine: Thanks for telling me the name of the tool. But I
like the GUI interfaces more and Eclipse has been serving me well for
browsing (only) the source code. I do not fear command line / shell but
prefer GUI more. Thanks for your answer.

-Vaibhav (*_*)

On Sun, Nov 21, 2010 at 2:49 AM, Dimitri Fontaine dimi...@2ndquadrant.frwrote:

 Tom Lane t...@sss.pgh.pa.us writes:
  Here's my single biggest tip for newcomers to the Postgres source:
  if you don't use ctags, glimpse, or some other tool that can quickly
  show you all references to a given identifier, go out and get one.
  It's one of the easiest ways to learn about things.

 I've been using cscope (out of advice from Joshua Tolley), and even
 better its integration into Emacs which is called xcscope.el --- I
 wouldn't have been able to come up with the extension patch series
 without that.

  http://cscope.sourceforge.net/

 And as a quick teaser, the keys I mostly use:

  C-c s d   cscope-find-global-definition
  C-c s s   cscope-find-this-symbol
  C-c s f   cscope-find-this-file
  C-c s I   cscope-index-files

 Regards,
 --
 Dimitri Fontaine
 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-11-20 Thread Terry Laurenzo

I've got a new stripped down version of the binary json plugin on github:
https://github.com/tlaurenzo/pgjson

With all due warning of contrived benchmarks, I wrote some tests to see
where things stand. The test script is here:
https://github.com/tlaurenzo/pgjson/blob/master/testdocs/runbenchmark.sh
The results from my laptop (First gen Macbook Pro 32bit with Snow Leopard
and Postgresql 9.0) are here and copied into this email at the end:

https://github.com/tlaurenzo/pgjson/blob/master/testdocs/benchmark-results-2010-11-20-mbp32.txt

And for some commentary...
I copied the 5 sample documents from json.org's example section for these
tests. These are loaded into a table with a varchar column 1000 times each
(so the test table has 5000 rows in it). In all situations, the binary
encoding was smaller than the normalized text form (between 9 and 23%
smaller). I think there are cases where the binary form will be larger than
the corresponding text form, but I don't think they would be very common.

For the timings, various operations are performed on the 5000 row test table
for 100 iterations. In all situations, the query returns the Length of the
transformed document instead of the document itself so as to factor out
variable client IO between the test runs. The Null Parse is essentially
just a select from the table and therefore represents the baseline. The
times varied a little bit between runs but did not change materially.

What we see from this is that parsing JSON text and generating a binary
representation is cheap, representing approximately 10% of the base case
time. Conversely, anything that involves generating JSON text is expensive,
accounting for 30-40% of the base case time. Some incidental profiling
shows that while the entire operation is expensive, the process of
generating string literals dominates this time. There is likely room for
optimization in this method, but it should be noted that most of these
documents are lightly escaped (if escaped at all) which represents the happy
path through the string literal output function.

While I have not profiled any advanced manipulation of the binary structure
within the server, it stands to reason that manipulating the binary
structure should be significantly faster than an approach that requires
(perhaps multiple) transcoding between text representations in order to
complete a sequence of operations.

Assuming that the JSON datatype (at a minimum) normalizes text for storage,
then the text storage option accounts for about the most expensive path but
with none of the benefits of an internal binary form (smaller size, ability
to cheaply perform non-trivial manipulation within the database server).

Of course, just having a JSON datatype that blindly stores text will beat
everything, but I'm getting closer to thinking that the binary option is
worth the tradeoff.

Comments?
Terry

Running benchmark with 100 iterations per step
Loading data...
Data loaded.
=== DOCUMENT SIZE STATISTICS ===
Document Name | Original Size | Binary Size | Normalized Size |
Percentage Savings
--+---+-+-+
jsonorg_sample1.json | 582 | 311 | 360 |
13.6
jsonorg_sample2.json | 241 | 146 | 183 |
20.2185792349727
jsonorg_sample3.json | 601 | 326 | 389 |
16.1953727506427
jsonorg_sample4.json | 3467 |2466 |2710 |
9.00369003690037
jsonorg_sample5.json | 872 | 469 | 613 |
23.4910277324633
(5 rows)

=== TEST PARSE AND SERIALIZATION ===
Null Parse:
12.12 real 2.51 user 0.13 sys
Parse to Binary:
13.38 real 2.51 user 0.14 sys
Serialize from Binary:
16.65 real 2.51 user 0.13 sys
Normalize Text:
18.99 real 2.51 user 0.13 sys
Roundtrip (parse to binary and serialize):
18.58 real 2.51 user 0.14 sys

Re: [HACKERS] Fwd: patch: format function - fixed oid

2010-11-20 Thread Pavel Stehule

2010/11/21 Robert Haas robertmh...@gmail.com:
 On Sat, Nov 20, 2010 at 7:29 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 2010/11/20 Robert Haas robertmh...@gmail.com:
 On Sat, Nov 20, 2010 at 2:59 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 I think so you are a correct. Who will a submit this final version? You or 
 me?

 I've got it.


 Thank you

 nice a day

 OK, I've committed this, after a fairly heavy rewrite.

thank you very much

regards

Pavel Stehule

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

47 matches

Mail list logo