Re: [HACKERS] strncpy is not a safe version of strcpy

2014-08-13 Thread Tom Lane
Noah Misch  writes:
> On Wed, Aug 13, 2014 at 10:21:50AM -0400, Tom Lane wrote:
>> I believe that we deal with this by the expedient of checking the lengths
>> of tablespace paths in advance, when the tablespace is created.

> The files under scrutiny here are not located in a tablespace.  Even if they
> were, isn't the length of $PGDATA/pg_tblspc the important factor?

The length of $PGDATA is of no relevance whatsoever; we chdir into that
directory at startup, and subsequently all paths are implicitly relative
to there.  If there is any backend code that's prepending $PGDATA to
something else, it's wrong to start with.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] pg_copy - a command for reliable WAL archiving

2014-08-13 Thread MauMau

I fixed some minor mistakes.

Regards
MauMau


pg_copy_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Compute attr_needed for child relations (was Re: [HACKERS] inherit support for foreign tables)

2014-08-13 Thread Etsuro Fujita
(2014/08/08 18:51), Etsuro Fujita wrote:
>>> (2014/06/30 22:48), Tom Lane wrote:
 I wonder whether it isn't time to change that.  It was coded like that
 originally only because calculating the values would've been a waste of
 cycles at the time.  But this is at least the third place where it'd be
 useful to have attr_needed for child rels.

> I've revised the patch.

There was a problem with the previous patch, which will be described
below.  Attached is the updated version of the patch addressing that.

The previous patch doesn't cope with some UNION ALL cases properly.  So,
e.g., the server will crash for the following query:

postgres=# create table ta1 (f1 int);
CREATE TABLE
postgres=# create table ta2 (f2 int primary key, f3 int);
CREATE TABLE
postgres=# create table tb1 (f1 int);
CREATE TABLE
postgres=# create table tb2 (f2 int primary key, f3 int);
CREATE TABLE
postgres=# explain verbose select f1 from ((select f1, f2 from (select
f1, f2, f3 from ta1 left join ta2 on f1 = f2 limit 1) ssa) union all
(select f1,
f2 from (select f1, f2, f3 from tb1 left join tb2 on f1 = f2 limit 1)
ssb)) ss;

With the updated version, we get the right result:

postgres=# explain verbose select f1 from ((select f1, f2 from (select
f1, f2, f3 from ta1 left join ta2 on f1 = f2 limit 1) ssa) union all
(select f1,
f2 from (select f1, f2, f3 from tb1 left join tb2 on f1 = f2 limit 1)
ssb)) ss;
   QUERY PLAN

 Append  (cost=0.00..0.05 rows=2 width=4)
   ->  Subquery Scan on ssa  (cost=0.00..0.02 rows=1 width=4)
 Output: ssa.f1
 ->  Limit  (cost=0.00..0.01 rows=1 width=4)
   Output: ta1.f1, (NULL::integer), (NULL::integer)
   ->  Seq Scan on public.ta1  (cost=0.00..34.00 rows=2400
width=4)
 Output: ta1.f1, NULL::integer, NULL::integer
   ->  Subquery Scan on ssb  (cost=0.00..0.02 rows=1 width=4)
 Output: ssb.f1
 ->  Limit  (cost=0.00..0.01 rows=1 width=4)
   Output: tb1.f1, (NULL::integer), (NULL::integer)
   ->  Seq Scan on public.tb1  (cost=0.00..34.00 rows=2400
width=4)
 Output: tb1.f1, NULL::integer, NULL::integer
 Planning time: 0.453 ms
(14 rows)

While thinking to address this problem, Ashutosh also expressed concern
about the UNION ALL handling in the previous patch in a private email.
Thank you for the review, Ashutosh!

Thanks,

Best regards,
Etsuro Fujita
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 799,806  check_selective_binary_conversion(RelOptInfo *baserel,
  	}
  
  	/* Collect all the attributes needed for joins or final output. */
! 	pull_varattnos((Node *) baserel->reltargetlist, baserel->relid,
!    &attrs_used);
  
  	/* Add all the attributes used by restriction clauses. */
  	foreach(lc, baserel->baserestrictinfo)
--- 799,810 
  	}
  
  	/* Collect all the attributes needed for joins or final output. */
! 	for (i = baserel->min_attr; i <= baserel->max_attr; i++)
! 	{
! 		if (!bms_is_empty(baserel->attr_needed[i - baserel->min_attr]))
! 			attrs_used = bms_add_member(attrs_used,
! 		i - FirstLowInvalidHeapAttributeNumber);
! 	}
  
  	/* Add all the attributes used by restriction clauses. */
  	foreach(lc, baserel->baserestrictinfo)
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
***
*** 577,588  set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
  		childrel->has_eclass_joins = rel->has_eclass_joins;
  
  		/*
! 		 * Note: we could compute appropriate attr_needed data for the child's
! 		 * variables, by transforming the parent's attr_needed through the
! 		 * translated_vars mapping.  However, currently there's no need
! 		 * because attr_needed is only examined for base relations not
! 		 * otherrels.  So we just leave the child's attr_needed empty.
  		 */
  
  		/*
  		 * Compute the child's size.
--- 577,585 
  		childrel->has_eclass_joins = rel->has_eclass_joins;
  
  		/*
! 		 * Compute the child's attr_needed.
  		 */
+ 		adjust_appendrel_attr_needed(rel, childrel, appinfo);
  
  		/*
  		 * Compute the child's size.
***
*** 2173,2178  remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
--- 2170,2176 
  {
  	Bitmapset  *attrs_used = NULL;
  	ListCell   *lc;
+ 	int			i;
  
  	/*
  	 * Do nothing if subquery has UNION/INTERSECT/EXCEPT: in principle we
***
*** 2193,2204  remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
  	 * Collect a bitmap of all the output column numbers used by the upper
  	 * query.
  	 *
! 	 * Add all the attributes needed for joins or final output.  Note: we must
! 	 * look at reltargetlist, not the attr_needed data, because attr_needed
! 	 * isn't computed for inheritance child rels, cf set_append_rel_size().
! 	 * (XXX might be worth changi

[HACKERS] proposal: allow to specify result tupdesc and mode in SPI API

2014-08-13 Thread Pavel Stehule
Hello

we have not possibility to simple specify result types in SPI API
functions. Planner has this functionality - see transformInsertRow
function, but it is not visible from SPI.

A new function should to look like:



SPIPlanPtr
SPI_prepare_params_rettupdesc(const char *src,
   ParserSetupHook parserSetup,
   void *parserSetupArg,
   int cursorOptions,
   TupDesc *retTupDesc,
   int CoercionMode)


CoercionMode should be:

COERCION_MODE_SQL  .. same as INSERT or UPDATE does
COERCION_MODE_SQL_NOERROR .. same as above with possible IO cast
COERCION_MODE_EXPLICIT .. same as using explicit casting
COERCION_MODE_EXPLICIT_NOERROR .. same as previous with possible IO cast

Benefits:

* simplify life to SPI users - no necessary late casting

* possible small simplification of plpgsql with two benefits:

  ** reduce performance impact of hidden IO cast

  ** reduce possible issues with type transformation via hidden IO cast

Comments, notes?

Regards

Pavel Stehule


Re: [HACKERS] strncpy is not a safe version of strcpy

2014-08-13 Thread Noah Misch
On Wed, Aug 13, 2014 at 10:21:50AM -0400, Tom Lane wrote:
> Kevin Grittner  writes:
> > I am concerned that failure to check for truncation could allow
> > deletion of unexpected files or directories.
> 
> I believe that we deal with this by the expedient of checking the lengths
> of tablespace paths in advance, when the tablespace is created.

The files under scrutiny here are not located in a tablespace.  Even if they
were, isn't the length of $PGDATA/pg_tblspc the important factor?  $PGDATA can
change between runs if the DBA moves the data directory or reaches it via
different symlinks, so any DDL-time defense would be incomplete.

> > Some might consider it overkill, but I tend to draw a pretty hard
> > line on deleting or executing random files, even if the odds seem
> > to be that the mangled name won't find a match.  Granted, those 
> > problems exist now, but without checking for truncation it seems to 
> > me that we're just deleting *different* incorrect filenames, not 
> > really fixing the problem.

I share your (Kevin's) discomfort with our use of strlcpy().  I wouldn't mind
someone replacing most strlcpy()/snprintf() calls with calls to wrappers that
ereport(ERROR) on truncation.  Though as reliability problems go, this one has
been minor.

David's specific patch has no concrete problem:

On Wed, Aug 13, 2014 at 10:26:01PM +1200, David Rowley wrote:
> --- a/contrib/pg_archivecleanup/pg_archivecleanup.c
> +++ b/contrib/pg_archivecleanup/pg_archivecleanup.c
> @@ -108,7 +108,7 @@ CleanupPriorWALFiles(void)
>   {
>   while (errno = 0, (xlde = readdir(xldir)) != NULL)
>   {
> - strncpy(walfile, xlde->d_name, MAXPGPATH);
> + strlcpy(walfile, xlde->d_name, MAXPGPATH);

The code proceeds to check strlen(walfile) == XLOG_DATA_FNAME_LEN, so a long
name can't trick it.

>   TrimExtension(walfile, additional_ext);
>  
>   /*
> diff --git a/src/backend/access/transam/xlogarchive.c 
> b/src/backend/access/transam/xlogarchive.c
> index 37745dc..0c9498a 100644
> --- a/src/backend/access/transam/xlogarchive.c
> +++ b/src/backend/access/transam/xlogarchive.c
> @@ -459,7 +459,7 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
>   xlogfpath, oldpath)));
>   }
>  #else
> - strncpy(oldpath, xlogfpath, MAXPGPATH);
> + strlcpy(oldpath, xlogfpath, MAXPGPATH);

This one never overflows, because it's copying from one MAXPGPATH buffer to
another.  Plain strcpy() would be fine, too.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-13 Thread Amit Kapila
On Thu, Aug 14, 2014 at 5:56 AM, Stephen Frost  wrote:
> * Amit Kapila (amit.kapil...@gmail.com) wrote:
> Oh, to be clear- I agree that logging of queries executed through SPI is
> desirable; I'd certainly like to have that capability without having to
> go through the auto-explain module or write my own module.  That doesn't
> mean we should consider them either "internal" (which I don't really
> agree with- consider anonymous DO blocks...) or somehow related to
> replication logging.
>
> > >Could we enable logging of both with a single GUC?
> > >
> > > I don't see those as the same at all.  I'd like to see improved
> > > flexibility in this area, certainly, but don't want two independent
> > > considerations like these tied to one GUC..
> >
> > Agreed, I also think both are different and shouldn't be logged
> > with one GUC.  Here the important thing to decide is which is
> > the better way to proceed for allowing logging of replication
> > commands such that it can allow us to extend it for more
> > things.  We have discussed below options:
> >
> > a. Make log_statement a list of comma separated options
> > b. Have a separate parameter something like log_replication*
> > c. Log it when user has mentioned option 'all' in log_statement.
>
> Regarding this, I'm generally in the camp that says to just include it
> in 'all' and be done with it- for now.

Okay, but tomorrow if someone wants to implement a patch to log
statements executed through SPI (statements inside functions), then
what will be your suggestion, does those also can be allowed to log
with 'all' option or you would like to suggest him to wait for a proper
auditing system?

Wouldn't allowing to log everything under 'all' option can start
confusing some users without having individual
(ddl, mod, replication, ...) options for different kind of statements.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] pg_dump bug in 9.4beta2 and HEAD

2014-08-13 Thread Joachim Wieland
I'm seeing an assertion failure with "pg_dump -c --if-exists" which is
not ready to handle BLOBs it seems:

pg_dump: pg_backup_archiver.c:472: RestoreArchive: Assertion `mark !=
((void *)0)' failed.

To reproduce:

$ createdb test
$ pg_dump -c --if-exists test  (works, dumps empty database)
$ psql test -c "select lo_create(1);"
$ pg_dump -c --if-exists test  (fails, with the above mentioned assertion)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-13 Thread Amit Kapila
On Wed, Aug 13, 2014 at 4:25 PM, Andres Freund 
wrote:
>
> On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:
> > Overall, the main changes required in patch as per above feedback
> > are:
> > 1. add an additional counter for the number of those
> > allocations not satisfied from the free list, with a
> > name like buffers_alloc_clocksweep.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. In the patch, I have kept them as hard-coded
> > values.
> > 3. For populating freelist, have a separate process (bgreclaim)
> > instead of doing it by bgwriter.
>
> I'm not convinced that 3) is the right way to go to be honest. Seems
> like a huge bandaid to me.

Doing both (populating freelist and flushing dirty buffers) via bgwriter
isn't the best way either because it might not be able to perform
both the jobs as per need.
One example is it could take much longer time to flush a dirty buffer
than to move it into free list, so if there are few buffers which we need
to flush, then I think task of maintaining buffers in freelist will get hit
even though there are buffers in list which can be moved to
free list(non-dirty buffers).
Another is maintaining the current behaviour of bgwriter which is to scan
the entire buffer pool every few mins (assuming default configuration).
We can attempt to solve this problem as suggested by Robert upthread
but I am not completely sure if that can guarantee that the current
behaviour will be retained as it is.

I am not telling that having a separate process won't have any issues,
but I think we can tackle them without changing or complicating current
bgwriter behaviour.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-13 Thread Tom Lane
Andrew Dunstan  writes:
> On 08/13/2014 09:01 PM, Tom Lane wrote:
>>> That's a fair question.  I did a very very simple hack to replace the item
>>> offsets with item lengths -- turns out that that mostly requires removing
>>> some code that changes lengths to offsets ;-).

> What does changing to lengths do to the speed of other operations?

This was explicitly *not* an attempt to measure the speed issue.  To do
a fair trial of that, you'd have to work a good bit harder, methinks.
Examining each of N items would involve O(N^2) work with the patch as
posted, but presumably you could get it down to less in most or all
cases --- in particular, sequential traversal could be done with little
added cost.  But it'd take a lot more hacking.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-13 Thread Michael Paquier
On Thu, Aug 14, 2014 at 10:40 AM, Fujii Masao  wrote:
> On Thu, Aug 14, 2014 at 9:26 AM, Stephen Frost  wrote:
>> * Amit Kapila (amit.kapil...@gmail.com) wrote:
>>> On Wed, Aug 13, 2014 at 4:24 AM, Stephen Frost  wrote:
>>> > Not entirely sure what you're referring to as 'internally generated'
>>> > here..
>>>
>>> Here 'internally generated' means that user doesn't execute those
>>> statements, rather the replication/backup code form these statements
>>> (IDENTIFY_SYSTEM, TIMELINE_HISTORY, BASE_BACKUP, ...)
>>> and send to server to get the appropriate results.
>>
>> You could argue the same about pg_dump..  I'd not thought of it before,
>> but it might be kind of neat to have psql support "connect in
>> replication mode" and then allow the user to run replication commands.
>
> You can do that by specifying "replication=1" as the conninfo when
> exexcuting psql. For example,
>
> $ psql -d "replication=1"
> psql (9.5devel)
> Type "help" for help.
>
> postgres=# IDENTIFY_SYSTEM;
>   systemid   | timeline |  xlogpos  | dbname
> -+--+---+
>  6047222920639525794 |1 | 0/1711678 |
> (1 row)
More details here:
http://www.postgresql.org/docs/9.4/static/protocol-replication.html

Note as well that not all the commands work though:
=# START_REPLICATION PHYSICAL 0/0;
unexpected PQresultStatus: 8
=# START_REPLICATION PHYSICAL 0/0;
PQexec not allowed during COPY BOTH
We may do something to improve about that but I am not sure it is worth it.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] option -T in pg_basebackup doesn't work on windows

2014-08-13 Thread Michael Paquier
On Thu, Aug 14, 2014 at 12:50 AM, MauMau  wrote:
> The code change appears correct, but the patch application failed against
> the latest source code.  I don't know why.  Could you confirm this?
>
> patching file src/bin/pg_basebackup/pg_basebackup.c
> Hunk #1 FAILED at 1119.
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/bin/pg_basebackup/pg_basebackup.c.rej
This conflict is caused by f25e0bf.

> On the following line, I think %d must be %u, because Oid is an unsigned
> integer.
>  char*linkloc = psprintf("%s/pg_tblspc/%d", basedir, oid);
That's correct.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-13 Thread Andrew Dunstan


On 08/13/2014 09:01 PM, Tom Lane wrote:

I wrote:

That's a fair question.  I did a very very simple hack to replace the item
offsets with item lengths -- turns out that that mostly requires removing
some code that changes lengths to offsets ;-).  I then loaded up Larry's
example of a noncompressible JSON value, and compared pg_column_size()
which is just about the right thing here since it reports datum size after
compression.  Remembering that the textual representation is 12353 bytes:
json:   382 bytes
jsonb, using offsets:   12593 bytes
jsonb, using lengths:   406 bytes

Oh, one more result: if I leave the representation alone, but change
the compression parameters to set first_success_by to INT_MAX, this
value takes up 1397 bytes.  So that's better, but still more than a
3X penalty compared to using lengths.  (Admittedly, this test value
probably is an outlier compared to normal practice, since it's a hundred
or so repetitions of the same two strings.)






What does changing to lengths do to the speed of other operations?

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-13 Thread Fujii Masao
On Thu, Aug 14, 2014 at 9:26 AM, Stephen Frost  wrote:
> * Amit Kapila (amit.kapil...@gmail.com) wrote:
>> On Wed, Aug 13, 2014 at 4:24 AM, Stephen Frost  wrote:
>> > Not entirely sure what you're referring to as 'internally generated'
>> > here..
>>
>> Here 'internally generated' means that user doesn't execute those
>> statements, rather the replication/backup code form these statements
>> (IDENTIFY_SYSTEM, TIMELINE_HISTORY, BASE_BACKUP, ...)
>> and send to server to get the appropriate results.
>
> You could argue the same about pg_dump..  I'd not thought of it before,
> but it might be kind of neat to have psql support "connect in
> replication mode" and then allow the user to run replication commands.

You can do that by specifying "replication=1" as the conninfo when
exexcuting psql. For example,

$ psql -d "replication=1"
psql (9.5devel)
Type "help" for help.

postgres=# IDENTIFY_SYSTEM;
  systemid   | timeline |  xlogpos  | dbname
-+--+---+
 6047222920639525794 |1 | 0/1711678 |
(1 row)


>> Agreed, I also think both are different and shouldn't be logged
>> with one GUC.  Here the important thing to decide is which is
>> the better way to proceed for allowing logging of replication
>> commands such that it can allow us to extend it for more
>> things.  We have discussed below options:
>>
>> a. Make log_statement a list of comma separated options
>> b. Have a separate parameter something like log_replication*
>> c. Log it when user has mentioned option 'all' in log_statement.
>
> Regarding this, I'm generally in the camp that says to just include it
> in 'all' and be done with it- for now.  This is just another example
> of where we really need a much more granular and configurable framework
> for auditing and we're not going to get through GUCs, so let's keep the
> GUC based approach simple and familiar to our users and build a proper
> auditing system.

+1

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-13 Thread Tom Lane
I wrote:
> That's a fair question.  I did a very very simple hack to replace the item
> offsets with item lengths -- turns out that that mostly requires removing
> some code that changes lengths to offsets ;-).  I then loaded up Larry's
> example of a noncompressible JSON value, and compared pg_column_size()
> which is just about the right thing here since it reports datum size after
> compression.  Remembering that the textual representation is 12353 bytes:

> json: 382 bytes
> jsonb, using offsets: 12593 bytes
> jsonb, using lengths: 406 bytes

Oh, one more result: if I leave the representation alone, but change
the compression parameters to set first_success_by to INT_MAX, this
value takes up 1397 bytes.  So that's better, but still more than a
3X penalty compared to using lengths.  (Admittedly, this test value
probably is an outlier compared to normal practice, since it's a hundred
or so repetitions of the same two strings.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-13 Thread Tom Lane
Bruce Momjian  writes:
> Seems we have two issues:
> 1)  the header makes testing for compression likely to fail
> 2)  use of pointers rather than offsets reduces compression potential

> I understand we are focusing on #1, but how much does compression reduce
> the storage size with and without #2?  Seems we need to know that answer
> before deciding if it is worth reducing the ability to do fast lookups
> with #2.

That's a fair question.  I did a very very simple hack to replace the item
offsets with item lengths -- turns out that that mostly requires removing
some code that changes lengths to offsets ;-).  I then loaded up Larry's
example of a noncompressible JSON value, and compared pg_column_size()
which is just about the right thing here since it reports datum size after
compression.  Remembering that the textual representation is 12353 bytes:

json:   382 bytes
jsonb, using offsets:   12593 bytes
jsonb, using lengths:   406 bytes

So this confirms my suspicion that the choice of offsets not lengths
is what's killing compressibility.  If it used lengths, jsonb would be
very nearly as compressible as the original text.

Hack attached in case anyone wants to collect more thorough statistics.
We'd not actually want to do it like this because of the large expense
of recomputing the offsets on-demand all the time.  (It does pass the
regression tests, for what that's worth.)

regards, tom lane

diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index 04f35bf..2297504 100644
*** a/src/backend/utils/adt/jsonb_util.c
--- b/src/backend/utils/adt/jsonb_util.c
*** convertJsonbArray(StringInfo buffer, JEn
*** 1378,1385 
  	 errmsg("total size of jsonb array elements exceeds the maximum of %u bytes",
  			JENTRY_POSMASK)));
  
- 		if (i > 0)
- 			meta = (meta & ~JENTRY_POSMASK) | totallen;
  		copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry));
  		metaoffset += sizeof(JEntry);
  	}
--- 1378,1383 
*** convertJsonbObject(StringInfo buffer, JE
*** 1430,1437 
  	 errmsg("total size of jsonb array elements exceeds the maximum of %u bytes",
  			JENTRY_POSMASK)));
  
- 		if (i > 0)
- 			meta = (meta & ~JENTRY_POSMASK) | totallen;
  		copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry));
  		metaoffset += sizeof(JEntry);
  
--- 1428,1433 
*** convertJsonbObject(StringInfo buffer, JE
*** 1445,1451 
  	 errmsg("total size of jsonb array elements exceeds the maximum of %u bytes",
  			JENTRY_POSMASK)));
  
- 		meta = (meta & ~JENTRY_POSMASK) | totallen;
  		copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry));
  		metaoffset += sizeof(JEntry);
  	}
--- 1441,1446 
*** uniqueifyJsonbObject(JsonbValue *object)
*** 1592,1594 
--- 1587,1600 
  		object->val.object.nPairs = res + 1 - object->val.object.pairs;
  	}
  }
+ 
+ uint32
+ jsonb_get_offset(const JEntry *ja, int index)
+ {
+ 	uint32	off = 0;
+ 	int i;
+ 
+ 	for (i = 0; i < index; i++)
+ 		off += JBE_LEN(ja, i);
+ 	return off;
+ }
diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h
index 5f2594b..c9b18e1 100644
*** a/src/include/utils/jsonb.h
--- b/src/include/utils/jsonb.h
*** typedef uint32 JEntry;
*** 153,162 
   * Macros for getting the offset and length of an element. Note multiple
   * evaluations and access to prior array element.
   */
! #define JBE_ENDPOS(je_)			((je_) & JENTRY_POSMASK)
! #define JBE_OFF(ja, i)			((i) == 0 ? 0 : JBE_ENDPOS((ja)[i - 1]))
! #define JBE_LEN(ja, i)			((i) == 0 ? JBE_ENDPOS((ja)[i]) \
!  : JBE_ENDPOS((ja)[i]) - JBE_ENDPOS((ja)[i - 1]))
  
  /*
   * A jsonb array or object node, within a Jsonb Datum.
--- 153,163 
   * Macros for getting the offset and length of an element. Note multiple
   * evaluations and access to prior array element.
   */
! #define JBE_LENFLD(je_)			((je_) & JENTRY_POSMASK)
! #define JBE_OFF(ja, i)			jsonb_get_offset(ja, i)
! #define JBE_LEN(ja, i)			JBE_LENFLD((ja)[i])
! 
! extern uint32 jsonb_get_offset(const JEntry *ja, int index);
  
  /*
   * A jsonb array or object node, within a Jsonb Datum.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-13 Thread Stephen Frost
* Amit Kapila (amit.kapil...@gmail.com) wrote:
> On Wed, Aug 13, 2014 at 4:24 AM, Stephen Frost  wrote:
> > Not entirely sure what you're referring to as 'internally generated'
> > here..
> 
> Here 'internally generated' means that user doesn't execute those
> statements, rather the replication/backup code form these statements
> (IDENTIFY_SYSTEM, TIMELINE_HISTORY, BASE_BACKUP, ...)
> and send to server to get the appropriate results.

You could argue the same about pg_dump..  I'd not thought of it before,
but it might be kind of neat to have psql support "connect in
replication mode" and then allow the user to run replication commands.
Still, this isn't the same.

> Yes, few days back I have seen one of the user expects
> such functionality. Refer below mail:
> http://www.postgresql.org/message-id/caa4ek1lvuirqnmjv1vtmrg_f+1f9e9-8rdgnwidscqvtps1...@mail.gmail.com

Oh, to be clear- I agree that logging of queries executed through SPI is
desirable; I'd certainly like to have that capability without having to
go through the auto-explain module or write my own module.  That doesn't
mean we should consider them either "internal" (which I don't really
agree with- consider anonymous DO blocks...) or somehow related to
replication logging.

> >Could we enable logging of both with a single GUC?
> >
> > I don't see those as the same at all.  I'd like to see improved
> > flexibility in this area, certainly, but don't want two independent
> > considerations like these tied to one GUC..
> 
> Agreed, I also think both are different and shouldn't be logged
> with one GUC.  Here the important thing to decide is which is
> the better way to proceed for allowing logging of replication
> commands such that it can allow us to extend it for more
> things.  We have discussed below options:
> 
> a. Make log_statement a list of comma separated options
> b. Have a separate parameter something like log_replication*
> c. Log it when user has mentioned option 'all' in log_statement.

Regarding this, I'm generally in the camp that says to just include it
in 'all' and be done with it- for now.  This is just another example
of where we really need a much more granular and configurable framework
for auditing and we're not going to get through GUCs, so let's keep the
GUC based approach simple and familiar to our users and build a proper
auditing system. 

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-13 Thread
Tom,

I appreciate your patience and explanation. (I am new to PostgreSQL hacking.  I 
have read many old posts but not all of it sticks, sorry).
I know QNX support is not high on your TODO list, so I am trying to keep the 
effort moving without being a distraction.

Couldn't backend "random code" corrupt any file in the PGDATA dir?
Perhaps the new fcntl lock file could be kept outside PGDATA directory tree to 
make likelihood of backend "random code" interference remote.
This could be present and used only on systems without System V shared memory 
(QNX), leaving existing platforms unaffected.

I know this falls short of perfect, but perhaps is good enough to get the QNX 
port off the ground.
I would rather have a QNX port with reasonable restrictions than no port at all.

Also, I will try to experiment with named pipe locking as Robert had suggested.
Thanks again for your feedback, I really do appreciate it.

-Keith Baker

> -Original Message-
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Sent: Wednesday, August 13, 2014 7:05 PM
> To: Baker, Keith [OCDUS Non-J&J]
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
> 
> "Baker, Keith [OCDUS Non-J&J]"  writes:
> > I assume you guys are working on other priorities, so I did some locking
> experiments on QNX.
> 
> > I know fcntl() locking has downsides, but I think it deserves a second look:
> > - it is POSIX, so should be fairly consistent across platforms (at
> > least more consistent than lockf and flock)
> > - the "accidental" open/close lock release can be easily avoided
> > (simply don't add new code which touches the new, unique lock file)
> 
> I guess you didn't read the previous discussion.  Asserting that it's "easy to
> avoid" an accidental unlock doesn't make it true.  In the case of a PG
> backend, we have to expect that people will run random code inside, say,
> plperlu or plpythonu functions.  And it doesn't seem unlikely that someone
> might scan the entire PGDATA directory tree as part of, for example, a
> backup or archiving operation.  If we had full control of everything that ever
> happens in a PG backend process then *maybe* we could have adequate
> confidence that we'd never lose the lock, but we don't.
> 
>   regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What happened to jsonb's JENTRY_ISFIRST?

2014-08-13 Thread Peter Geoghegan
On Wed, Aug 13, 2014 at 3:47 PM, Tom Lane  wrote:
> If the bit is unused now, should we be worrying about reclaiming it for
> better use?  Like say allowing jsonb's to be larger than just a quarter
> of the maximum datum size?

Commit d9daff0e0cb15221789e6c50d9733c8754c054fb removed it. This is an
obsolete comment.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-13 Thread Tom Lane
"Baker, Keith [OCDUS Non-J&J]"  writes:
> I assume you guys are working on other priorities, so I did some locking 
> experiments on QNX.

> I know fcntl() locking has downsides, but I think it deserves a second look:
> - it is POSIX, so should be fairly consistent across platforms (at least more 
> consistent than lockf and flock)
> - the "accidental" open/close lock release can be easily avoided (simply 
> don't add new code which touches the new, unique lock file)

I guess you didn't read the previous discussion.  Asserting that it's
"easy to avoid" an accidental unlock doesn't make it true.  In the case of
a PG backend, we have to expect that people will run random code inside,
say, plperlu or plpythonu functions.  And it doesn't seem unlikely that
someone might scan the entire PGDATA directory tree as part of, for
example, a backup or archiving operation.  If we had full control of
everything that ever happens in a PG backend process then *maybe* we could
have adequate confidence that we'd never lose the lock, but we don't.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] What happened to jsonb's JENTRY_ISFIRST?

2014-08-13 Thread Tom Lane
jsonb.h claims that the high order bit of a JEntry word is set on the
first element of a JEntry array.  However, AFAICS, JBE_ISFIRST() is
not used anywhere, which is a good thing because it refers to a constant
JENTRY_ISFIRST that's not defined anywhere.  Is this comment just a leftover
from a convention that's been dropped, or is it still implemented but not
via this macro?

If the bit is unused now, should we be worrying about reclaiming it for
better use?  Like say allowing jsonb's to be larger than just a quarter
of the maximum datum size?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-13 Thread
Robert and Tom,

I assume you guys are working on other priorities, so I did some locking 
experiments on QNX.

I know fcntl() locking has downsides, but I think it deserves a second look:
- it is POSIX, so should be fairly consistent across platforms (at least more 
consistent than lockf and flock)
- the "accidental" open/close lock release can be easily avoided (simply don't 
add new code which touches the new, unique lock file)
- don't know if it will work on NFS, but that is not a priority for me (is that 
really a requirement for a QNX port?)

Existing System V shared memory locking can be left in place for all existing 
platforms (so nothing lost), while fcntl()-style locking could be limited to 
platforms which lack System V shared memory (like QNX).

Experimental patch is attached, but logic is basically this:
a. postmaster obtains exclusive lock on data dir file "postmaster.fcntl" (or 
FATAL)
b. postmaster then downgrades to shared lock (or FATAL)
c. all other backend processes obtain shared lock on this file (or FATAL)

A quick test on QNX 6.5 appeared to behave well (orphan backends left behind 
after kill -9 of postmaster held their locks, thus database restart was 
prevented as desired).
Let me know if there are other test scenarios to consider.

Thanks!

-Keith Baker


> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Thursday, July 31, 2014 12:58 PM
> To: Tom Lane
> Cc: Baker, Keith [OCDUS Non-J&J]; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL
> 
> On Wed, Jul 30, 2014 at 11:02 AM, Tom Lane  wrote:
> > So it seems like we could possibly go this route, assuming we can
> > think of a variant of your proposal that's race-condition-free.  A
> > disadvantage compared to a true file lock is that it would not protect
> > against people trying to start postmasters from two different NFS
> > client machines --- but we don't have protection against that now.
> > (Maybe we could do this *and* do a regular file lock to offer some
> > protection against that case, even if it's not bulletproof?)
> 
> That's not a bad idea.  By the way, it also wouldn't be too hard to test at
> runtime whether or not flock() has first-close semantics.  Not that we'd want
> this exact design, but suppose you configure shmem_interlock=flock in
> postgresql.conf.  On startup, we test whether flock is reliable, determine
> that it is, and proceed accordingly.
> Now, you move your database onto an NFS volume and the semantics
> change (because, hey, breaking userspace assumptions is fun) and try to
> restart up your database, and it says FATAL: flock() is broken.
> Now you can either move the database back, or set shmem_interlock to
> some other value.
> 
> Now maybe, as you say, it's best to use multiple locking protocols and hope
> that at least one will catch whatever the dangerous situation is.
> I'm just trying to point out that we need not blindly assume the semantics we
> want are there (or that they are not); we can check.
> 
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
> Company


fcntl_lock_20140813.patch
Description: fcntl_lock_20140813.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to implement selectivity injection in postgresql

2014-08-13 Thread Tom Lane
Jeff Janes  writes:
> On Wed, Aug 13, 2014 at 9:33 AM, Rajmohan C  wrote:
>> I need to implement Selectivity injection as shown in above query in
>> PostgreSQL by which we can inject selectivity of each predicate or at least
>> selectivity at relation level directly as part of query.

> My plan was to create a boolean operator which always returns true, but
> estimates its own selectivity as 0.001 (or better yet, parameterize that
> selectivity estimate, if that is possible) which can be inserted into the
> place where lower selectivity estimate is needed with an "AND".

That doesn't seem especially helpful/convenient, especially not if you're
trying to affect the estimation of a join clause.  The last discussion
I remember on this subject was to invent a special dummy function that
would be understood by the planner and would work sort of like
__builtin_expect() in gcc:

selectivity(condition bool, probability float8) returns bool

Semantically the function would just return its first argument (and
the function itself would disappear at runtime) but the planner would
take the value of the second argument as a selectivity estimate overriding
whatever it might've otherwise deduced about the "condition".  So
you'd use it like

  SELECT ... WHERE selectivity(id = 42, 0.0001)

and get functionally the same results as for

  SELECT ... WHERE id = 42

but with a different selectivity estimate for that WHERE condition.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Alvaro Herrera
Heikki Linnakangas wrote:

> Here's a full version of this refactoring patch, all the rmgr's have
> now been updated to use XLogReplayBuffer(). I think this is a
> worthwhile change on its own, even if we drop the ball on the rest
> of the WAL format patch, because it makes the redo-routines more
> readable. I propose to commit this as soon as someone has reviewed
> it, and we agree on a for what's currently called
> XLogReplayBuffer(). I have tested this with my page-image comparison
> tool.

What's with XLogReplayLSN and XLogReplayRecord?  IMO the xlog code has
more than enough global variables already, it'd be good to avoid two
more if possible.  Is there no other good way to get this info down to
whatever it is that needs them?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Alvaro Herrera
Heikki Linnakangas wrote:

> Here's a full version of this refactoring patch, all the rmgr's have
> now been updated to use XLogReplayBuffer(). I think this is a
> worthwhile change on its own, even if we drop the ball on the rest
> of the WAL format patch, because it makes the redo-routines more
> readable. I propose to commit this as soon as someone has reviewed
> it, and we agree on a for what's currently called
> XLogReplayBuffer(). I have tested this with my page-image comparison
> tool.

XLogReadBufferForReplay ?
ReadBufferForXLogReplay ?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to implement selectivity injection in postgresql

2014-08-13 Thread Jeff Janes
On Wed, Aug 13, 2014 at 9:33 AM, Rajmohan C  wrote:

> SELECT c1, c2, c3, FROM T1, T2, T3
> WHERE T1.x = T2.x AND
> T2.y=T3.y AND
> T1.x >= ? selectivity 0.1 AND
> T2.y > ? selectivity 0.5 AND
> T3.z = ? selectivity 0.2 AND
> T3.w = ?
>
>  I need to implement Selectivity injection as shown in above query in
> PostgreSQL by which we can inject selectivity of each predicate or at least
> selectivity at relation level directly as part of query. Is there any
> on-going work on this front? If there is no ongoing work on this, How
> should I start implementing this feature?
>

My plan was to create a boolean operator which always returns true, but
estimates its own selectivity as 0.001 (or better yet, parameterize that
selectivity estimate, if that is possible) which can be inserted into the
place where lower selectivity estimate is needed with an "AND".

And another one that always returns false, but has a selectivity estimate
near 1, for use in OR conditions when the opposite change is needed.

I think that will be much easier to do than to extent the grammar.  And
probably more acceptable to the core team.

I think this could be done simply in an extension module without even
needing to change the core code, but I never got around to investigating
exactly how.

Cheers,

Jeff


Re: [HACKERS] PostrgeSQL vs oracle doing 1 million sqrts am I doing it wrong?

2014-08-13 Thread Pavel Stehule
2014-08-08 2:13 GMT+02:00 Josh Berkus :

> On 08/07/2014 04:48 PM, Tom Lane wrote:
> > plpgsql is not efficient at all about coercions performed as a side
> > effect of assignments; if memory serves, it always handles them by
> > converting to text and back.  So basically the added cost here came
> > from float8out() and float4in().  There has been some talk of trying
> > to do such coercions via SQL casts, but nothing's been done for fear
> > of compatibility problems.
>
> Yeah, that's a weeks-long project for someone.  And would require a lot
> of tests ...
>

It is not trivial task. There are two possible direction and both are not
trivial (I am not sure about practical benefits for users - maybe for some
PostGIS users - all for some trivial very synthetic benchmarks)

a) we can enhance plpgsql exec_assign_value to accept pointer to cache on
tupmap - it is relative invasive in plpgsql - and without benefits to other
PL

b) we can enhance SPI API to accept target TupDesc (with reusing
transformInsertRow) - it should be little bit less invasive in plpgsql, but
require change in SPI API. This path should be much more preferable - it
can be used in SQL/PSM and it can be used in lot of C extensions - It can
be more simply to specify expected TupDesc than enforce casting via
manipulation with SQL string. I missed this functionality more times. I
designed PL/pgPSM with same type strict level as PostgreSQL has - and this
functionality can simplify code.

PLpgSQL uses spi_prepare_params .. we can enhance this function or we can
introduce new spi_prepare_params_enforce_result_type

Regards

Pavel


>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] how to implement selectivity injection in postgresql

2014-08-13 Thread Euler Taveira
On 13-08-2014 15:28, Rajmohan C wrote:
> Yeah. I have to do it for my academic research. Is it available in
> catalogs? It is to be computed at run time from the predicates in the query
> right?
> 
The selectivity information is available at runtime. See
backend/optimizer/path/costsize.c.


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to implement selectivity injection in postgresql

2014-08-13 Thread Euler Taveira
On 13-08-2014 13:33, Rajmohan C wrote:
> SELECT c1, c2, c3, FROM T1, T2, T3
> WHERE T1.x = T2.x AND
> T2.y=T3.y AND
> T1.x >= ? selectivity 0.1 AND
> T2.y > ? selectivity 0.5 AND
> T3.z = ? selectivity 0.2 AND
> T3.w = ?
> 
> I need to implement Selectivity injection as shown in above query in
> PostgreSQL by which we can inject selectivity of each predicate or at least
> selectivity at relation level directly as part of query. Is there any
> on-going work on this front? If there is no ongoing work on this, How
> should I start implementing this feature?
> 
Do you want to force a selectivity? Why don't you let the optimizer do
it for you? Trust me it can do it better than you. If you want to force
those selectivities for an academic exercise, that information belongs
to catalog or could be SET before query starts.

Start reading backend/optimizer/README.


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-13 Thread Tomas Vondra
On 13.8.2014 12:31, Tomas Vondra wrote:
> On 13 Srpen 2014, 7:02, Jeff Davis wrote:
>> On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote:
>>>
>>>(b) bad estimate of required memory -> this is common for aggregates
>>>passing 'internal' state (planner uses some quite high defaults)
>>
>> Maybe some planner hooks? Ideas?
> 
> My plan is to add this to the CREATE AGGREGATE somehow - either as a
> constant parameter (allowing to set a custom constant size) or a callback
> to a 'sizing' function (estimating the size based on number of items,
> average width and ndistinct in the group). In any case, this is
> independent of this patch.

FWIW, the constant parameter is already implemented for 9.4. Adding the
function seems possible - the most difficult part seems to be getting
all the necessary info before count_agg_clauses() is called. For example
now dNumGroups is evaluated after the call (and tuples/group seems like
a useful info for sizing).

While this seems unrelated to the patch discussed here, it's true that:

  (a) good estimate of the memory is important for initial estimate of
  batch count

  (b) dynamic increase of batch count alleviates issues from
  underestimating the amount of memory necessary for states


But let's leave this out of scope for the current patch.


regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-13 Thread Claudio Freire
On Tue, Aug 12, 2014 at 8:00 PM, Bruce Momjian  wrote:
> On Mon, Aug 11, 2014 at 01:44:05PM -0700, Peter Geoghegan wrote:
>> On Mon, Aug 11, 2014 at 1:01 PM, Stephen Frost  wrote:
>> > We've got a clear example of someone, quite reasonably, expecting their
>> > JSONB object to be compressed using the normal TOAST mechanism, and
>> > we're failing to do that in cases where it's actually a win to do so.
>> > That's the focus of this discussion and what needs to be addressed
>> > before 9.4 goes out.
>>
>> Sure. I'm not trying to minimize that. We should fix it, certainly.
>> However, it does bear considering that JSON data, with each document
>> stored in a row is not an effective target for TOAST compression in
>> general, even as text.
>
> Seems we have two issues:
>
> 1)  the header makes testing for compression likely to fail
> 2)  use of pointers rather than offsets reduces compression potential

I do think the best solution for 2 is what's been proposed already, to
do delta-coding of the pointers in chunks (ie, 1 pointer, 15 deltas,
repeat).

But it does make binary search quite more complex.

Alternatively, it could be somewhat compressed as follows:

Segment = 1 pointer head, 15 deltas
Pointer head = pointers[0]
delta[i] = pointers[i] - pointers[0] for i in 1..15

(delta to segment head, not previous value)

Now, you can have 4 types of segments. 8, 16, 32, 64 bits, which is
the size of the deltas. You achieve between 8x and 1x compression, and
even when 1x (no compression), you make it easier for pglz to find
something compressible.

Accessing it is also simple, if you have a segment index (tough part here).

Replace the 15 for something that makes such segment index very compact ;)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Incremental Backup

2014-08-13 Thread Claudio Freire
On Tue, Aug 12, 2014 at 8:26 PM, Stephen Frost  wrote:
> * Claudio Freire (klaussfre...@gmail.com) wrote:
>> I'm not talking about malicious attacks, with big enough data sets,
>> checksum collisions are much more likely to happen than with smaller
>> ones, and incremental backups are supposed to work for the big sets.
>
> This is an issue when you're talking about de-duplication, not when
> you're talking about testing if two files are the same or not for
> incremental backup purposes.  The size of the overall data set in this
> case is not relevant as you're only ever looking at the same (at most
> 1G) specific file in the PostgreSQL data directory.  Were you able to
> actually produce a file with a colliding checksum as an existing PG
> file, the chance that you'd be able to construct one which *also* has
> a valid page layout sufficient that it wouldn't be obviously massivly
> corrupted is very quickly approaching zero.

True, but only with a strong hash, not an adler32 or something like that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

2014-08-13 Thread Tomas Vondra
On 13.8.2014 17:52, Tom Lane wrote:
> "Tomas Vondra"  writes:
>> So after 83 days, the regression tests on barnacle completed, and it
>> smells like another memory leak in CacheMemoryContext, similar to those
>> fixed in 078b2ed on May 18.
> 
> I've pushed fixes for the issues I was able to identify by running the
> create_view test.  I definitely don't have the patience to run all the
> tests this way, but if you do, please start a new run.
> 
> A couple thoughts about barnacle's configuration:
> 
> * It's not necessary to set -DCLOBBER_FREED_MEMORY or 
> -DMEMORY_CONTEXT_CHECKING
> in CPPFLAGS; those are already selected by --enable-cassert.  Removing
> those -D switches would suppress a whole boatload of compiler warnings.

I know - I already fixed this two months ago, but barnacle was still
running with the old settings (and I decided to let it run).

> * I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the
> same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is
> darned expensive and it's not clear you'd learn anything by running
> them both together.  I think you might be better advised to run two
> separate buildfarm critters with those two options, and thereby perhaps
> get turnaround in something less than 80 days.

OK, I removed this for barnacle/addax/mite, let's see what's the impact.

FWIW We have three other animals running with CLOBBER_CACHE_ALWAYS +
RANDOMIZE_ALLOCATED_MEMORY, and it takes ~20h per branch. But maybe the
price when combined with CLOBBER_CACHE_RECURSIVELY is much higher.

> * It'd likely be a good idea to take out the TestUpgrade and TestDecoding
> modules from the config too.  Otherwise, we won't be seeing barnacle's
> next report before 2015, judging from the runtime of the check step
> compared to some of the other slow buildfarm machines.  (I wonder whether
> there's an easy way to skip the installcheck step, as that's going to
> require a much longer run than it can possibly be worth too.)

OK, I did this too.

I stopped the already running test on addax and started the test on
barnacle again. Let's see in a few days/weeks/months what is the result.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Heikki Linnakangas

On 08/13/2014 04:15 PM, Heikki Linnakangas wrote:

Hmm, thinking about this some more, there is one sensible way to split
this patch: We can add the XLogReplayBuffer() function and rewrite all
the redo routines to use it, without changing any WAL record formats or
anything in the way the WAL records are constructed. In the patch,
XLogReplayBuffer() takes one input arument, the block reference ID, and
it fetches the RelFileNode and BlockNumber of the block based on that.
Without the WAL format changes, the information isn't there in the
record, but we can require the callers to pass the RelFileNode and
BlockNumber. The final patch will remove those arguments from every
caller, but that's a very mechanical change.

As in the attached patch. I only modified the heapam redo routines to
use the new XLogReplayBuffer() idiom; the idea is to do that for every
redo routine.

After applying such a patch, the main WAL format changing patch becomes
much smaller, and makes it easier to see from the redo routines where
significant changes to the WAL record formats have been made. This also
allows us to split the bikeshedding; we can discuss the name of
XLogReplayBuffer() first :-).


Here's a full version of this refactoring patch, all the rmgr's have now 
been updated to use XLogReplayBuffer(). I think this is a worthwhile 
change on its own, even if we drop the ball on the rest of the WAL 
format patch, because it makes the redo-routines more readable. I 
propose to commit this as soon as someone has reviewed it, and we agree 
on a for what's currently called XLogReplayBuffer(). I have tested this 
with my page-image comparison tool.


- Heikki

commit 4c6c7571e91f34919aff2c2981fe474537dc557b
Author: Heikki Linnakangas 
Date:   Wed Aug 13 15:39:08 2014 +0300

Refactor redo routines to use XLogReplayBuffer

diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index ffabf4e..acbb840 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -20,25 +20,23 @@
 static MemoryContext opCtx;		/* working memory for operations */
 
 static void
-ginRedoClearIncompleteSplit(XLogRecPtr lsn, RelFileNode node, BlockNumber blkno)
+ginRedoClearIncompleteSplit(XLogRecPtr lsn, int block_index,
+			RelFileNode node, BlockNumber blkno)
 {
 	Buffer		buffer;
 	Page		page;
 
-	buffer = XLogReadBuffer(node, blkno, false);
-	if (!BufferIsValid(buffer))
-		return;	/* page was deleted, nothing to do */
-	page = (Page) BufferGetPage(buffer);
-
-	if (lsn > PageGetLSN(page))
+	if (XLogReplayBuffer(block_index, node, blkno, &buffer) == BLK_NEEDS_REDO)
 	{
+		page = (Page) BufferGetPage(buffer);
+
 		GinPageGetOpaque(page)->flags &= ~GIN_INCOMPLETE_SPLIT;
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
-
-	UnlockReleaseBuffer(buffer);
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
 }
 
 static void
@@ -351,26 +349,14 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record)
 		rightChildBlkno = BlockIdGetBlockNumber((BlockId) payload);
 		payload += sizeof(BlockIdData);
 
-		if (record->xl_info & XLR_BKP_BLOCK(0))
-			(void) RestoreBackupBlock(lsn, record, 0, false, false);
-		else
-			ginRedoClearIncompleteSplit(lsn, data->node, leftChildBlkno);
+		ginRedoClearIncompleteSplit(lsn, 0, data->node, leftChildBlkno);
 	}
 
-	/* If we have a full-page image, restore it and we're done */
-	if (record->xl_info & XLR_BKP_BLOCK(isLeaf ? 0 : 1))
+	if (XLogReplayBuffer(isLeaf ? 0 : 1,
+		 data->node, data->blkno, &buffer) == BLK_NEEDS_REDO)
 	{
-		(void) RestoreBackupBlock(lsn, record, isLeaf ? 0 : 1, false, false);
-		return;
-	}
-
-	buffer = XLogReadBuffer(data->node, data->blkno, false);
-	if (!BufferIsValid(buffer))
-		return;	/* page was deleted, nothing to do */
-	page = (Page) BufferGetPage(buffer);
+		page = BufferGetPage(buffer);
 
-	if (lsn > PageGetLSN(page))
-	{
 		/* How to insert the payload is tree-type specific */
 		if (data->flags & GIN_INSERT_ISDATA)
 		{
@@ -386,8 +372,8 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record)
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
-
-	UnlockReleaseBuffer(buffer);
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
 }
 
 static void
@@ -476,12 +462,7 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record)
 	 * split
 	 */
 	if (!isLeaf)
-	{
-		if (record->xl_info & XLR_BKP_BLOCK(0))
-			(void) RestoreBackupBlock(lsn, record, 0, false, false);
-		else
-			ginRedoClearIncompleteSplit(lsn, data->node, data->leftChildBlkno);
-	}
+		ginRedoClearIncompleteSplit(lsn, 0, data->node, data->leftChildBlkno);
 
 	flags = 0;
 	if (isLeaf)
@@ -607,29 +588,19 @@ ginRedoVacuumDataLeafPage(XLogRecPtr lsn, XLogRecord *record)
 	Buffer		buffer;
 	Page		page;
 
-	/* If we have a full-page image, restore it and we're done */
-	if (record->xl_info & XLR_BKP_BLOCK(0))
+	if (XLogReplayBuffer(0, xlrec->node, xlrec->blkno, &buffer) == BLK_NEEDS_REDO)
 	{
-		(void) RestoreBackupBlock(lsn, record, 0, false, false);
-	

[HACKERS] how to implement selectivity injection in postgresql

2014-08-13 Thread Rajmohan C
SELECT c1, c2, c3, FROM T1, T2, T3
WHERE T1.x = T2.x AND
T2.y=T3.y AND
T1.x >= ? selectivity 0.1 AND
T2.y > ? selectivity 0.5 AND
T3.z = ? selectivity 0.2 AND
T3.w = ?

I need to implement Selectivity injection as shown in above query in
PostgreSQL by which we can inject selectivity of each predicate or at least
selectivity at relation level directly as part of query. Is there any
on-going work on this front? If there is no ongoing work on this, How
should I start implementing this feature?


Re: [HACKERS] failures on barnacle (CLOBBER_CACHE_RECURSIVELY) because of memory leaks

2014-08-13 Thread Tom Lane
"Tomas Vondra"  writes:
> So after 83 days, the regression tests on barnacle completed, and it
> smells like another memory leak in CacheMemoryContext, similar to those
> fixed in 078b2ed on May 18.

I've pushed fixes for the issues I was able to identify by running the
create_view test.  I definitely don't have the patience to run all the
tests this way, but if you do, please start a new run.

A couple thoughts about barnacle's configuration:

* It's not necessary to set -DCLOBBER_FREED_MEMORY or -DMEMORY_CONTEXT_CHECKING
in CPPFLAGS; those are already selected by --enable-cassert.  Removing
those -D switches would suppress a whole boatload of compiler warnings.

* I'm a bit dubious about testing -DRANDOMIZE_ALLOCATED_MEMORY in the
same build as -DCLOBBER_CACHE_RECURSIVELY, because each of these is
darned expensive and it's not clear you'd learn anything by running
them both together.  I think you might be better advised to run two
separate buildfarm critters with those two options, and thereby perhaps
get turnaround in something less than 80 days.

* It'd likely be a good idea to take out the TestUpgrade and TestDecoding
modules from the config too.  Otherwise, we won't be seeing barnacle's
next report before 2015, judging from the runtime of the check step
compared to some of the other slow buildfarm machines.  (I wonder whether
there's an easy way to skip the installcheck step, as that's going to
require a much longer run than it can possibly be worth too.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] option -T in pg_basebackup doesn't work on windows

2014-08-13 Thread MauMau

From: "Amit Kapila" 

During my recent work on pg_basebackup, I noticed that
-T option doesn't seem to work on Windows.
The reason for the same is that while updating symlinks
it doesn't consider that on Windows, junction points can
be directories due to which it is not able to update the
symlink location.
Fix is to make the code work like symlink removal code
in destroy_tablespace_directories.  Attached patch fixes
problem.


I could reproduce the problem on my Windows machine.

The code change appears correct, but the patch application failed against 
the latest source code.  I don't know why.  Could you confirm this?


patching file src/bin/pg_basebackup/pg_basebackup.c
Hunk #1 FAILED at 1119.
1 out of 1 hunk FAILED -- saving rejects to file 
src/bin/pg_basebackup/pg_basebackup.c.rej



On the following line, I think %d must be %u, because Oid is an unsigned 
integer.


 char*linkloc = psprintf("%s/pg_tblspc/%d", basedir, oid);

Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strncpy is not a safe version of strcpy

2014-08-13 Thread Kevin Grittner
Tom Lane  wrote:
> Kevin Grittner  writes:
>
>> I am concerned that failure to check for truncation could allow
>> deletion of unexpected files or directories.
>
> I believe that we deal with this by the expedient of checking the
> lengths of tablespace paths in advance, when the tablespace is 
> created.

As long as it is covered.

I would point out that the when strlcpy is used it returns a size_t
which can be directly compared to one of the arguments passed in
(in this case MAXPGPATH) to detect whether the name was truncated
for the cost of an integer compare (probably in registers).  No
additional scan of the data is needed.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strncpy is not a safe version of strcpy

2014-08-13 Thread Tom Lane
Kevin Grittner  writes:
> I am concerned that failure to check for truncation could allow
> deletion of unexpected files or directories.

I believe that we deal with this by the expedient of checking the lengths
of tablespace paths in advance, when the tablespace is created.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strncpy is not a safe version of strcpy

2014-08-13 Thread Heikki Linnakangas

On 08/13/2014 04:31 PM, Kevin Grittner wrote:

David Rowley  wrote:


I had a quick look at the usages of strncpy in master tonight and
I've really just picked out the obviously broken ones for now.
The other ones, on first look, either look safe, or require some
more analysis to see what's actually done with the string.

Does anyone disagree with the 2 changes in the attached?


I am concerned that failure to check for truncation could allow
deletion of unexpected files or directories.  While this is
probably not as dangerous as *executing* unexpected files, it seems
potentially problematic.  At the very least, a code comment
explaining why calling unlink on something which is not what
appears to be expected is not a problem there.

Some might consider it overkill, but I tend to draw a pretty hard
line on deleting or executing random files, even if the odds seem
to be that the mangled name won't find a match.  Granted, those
problems exist now, but without checking for truncation it seems to
me that we're just deleting *different* incorrect filenames, not
really fixing the problem.


strlcpy is clearly better than strncpy here, but I wonder if we should 
have yet another string copying function that throws an error instead of 
truncating, if the buffer is too small. What you really want in these 
cases is a "path too long" error.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries

2014-08-13 Thread Pavel Stehule
2014-08-13 15:22 GMT+02:00 MauMau :

> From: "Pavel Stehule" 
>
>> 2014-08-13 13:59 GMT+02:00 MauMau :
>>
>>> Are you concerned about the impactof collection overhead on the queries
>>>
>>> diagnosed?  Maybe not light, but I'm optimistic.  Oracle has the track
>>> record of long use, and MySQL provides performance schema starting from
>>> 5.6.
>>>
>>
>>
>> partially, I afraid about total performance (about impact on IO) - when we
>> use a usual tables, then any analyses without indexes are slow, so you
>> need
>> a indexes, and we cannot deferred index update. You should thinking about
>> retention policy - and without partitioning you got massive deletes. So I
>> cannot to imagine a usage of table based solution together with some
>> higher
>> load. Our MVCC storage is not practical for storing only inserted data,
>> and
>> some custom storage has no indexes - so this design is relative big
>> project.
>>
>> I prefer a possibility to read log via SQL (maybe some FDW) than use
>> tables
>> for storing log. These tables can be relative very large in few days - and
>> we cannot to write specialized engine like MySQL simply.
>>
>
> I didn't mean performance statistics data to be stored in database tables.
> I just meant:
>
> * pg_stat_system_events is a view to show data on memory, which returns
> one row for each event across the system.  This is similar to
> V$SYSTEM_EVENT in Oracle.
>
> * pg_stat_session_events is a view to show data on memory, which returns
> one row for each event on one session.  This is similar to V$SESSION_EVENT
> in Oracle.
>
> * The above views represent the current accumulated data like other
> pg_stat_xxx views.
>
> * EXPLAIN ANALYZE and auto_explain shows all events for one query.  The
> lock waits you are trying to record in the server log is one of the events.
>

I am little bit sceptic about only memory based structure. Is it this
concept acceptable for commiters?

Pavel


>
> Regards
> MauMau
>
>


Re: [HACKERS] ALTER TABLESPACE MOVE command tag tweak

2014-08-13 Thread Alvaro Herrera
Stephen Frost wrote:

> * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> > Stephen Frost wrote:

> > > Alright, sounds like this is more-or-less the concensus.  I'll see about
> > > making it happen shortly.
> > 
> > Were you able to work on this?
> 
> Apologies, I've been gone more-or-less all of July.  I'm back now and
> have time to spend on this.

Ping?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strncpy is not a safe version of strcpy

2014-08-13 Thread Kevin Grittner
David Rowley  wrote:

> I had a quick look at the usages of strncpy in master tonight and
> I've really just picked out the obviously broken ones for now.
> The other ones, on first look, either look safe, or require some
> more analysis to see what's actually done with the string.
>
> Does anyone disagree with the 2 changes in the attached?

I am concerned that failure to check for truncation could allow
deletion of unexpected files or directories.  While this is
probably not as dangerous as *executing* unexpected files, it seems
potentially problematic.  At the very least, a code comment
explaining why calling unlink on something which is not what
appears to be expected is not a problem there.

Some might consider it overkill, but I tend to draw a pretty hard
line on deleting or executing random files, even if the odds seem
to be that the mangled name won't find a match.  Granted, those 
problems exist now, but without checking for truncation it seems to 
me that we're just deleting *different* incorrect filenames, not 
really fixing the problem.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries

2014-08-13 Thread MauMau

From: "Pavel Stehule" 

2014-08-13 13:59 GMT+02:00 MauMau :

Are you concerned about the impactof collection overhead on the queries
diagnosed?  Maybe not light, but I'm optimistic.  Oracle has the track
record of long use, and MySQL provides performance schema starting from 
5.6.



partially, I afraid about total performance (about impact on IO) - when we
use a usual tables, then any analyses without indexes are slow, so you 
need

a indexes, and we cannot deferred index update. You should thinking about
retention policy - and without partitioning you got massive deletes. So I
cannot to imagine a usage of table based solution together with some 
higher
load. Our MVCC storage is not practical for storing only inserted data, 
and
some custom storage has no indexes - so this design is relative big 
project.


I prefer a possibility to read log via SQL (maybe some FDW) than use 
tables

for storing log. These tables can be relative very large in few days - and
we cannot to write specialized engine like MySQL simply.


I didn't mean performance statistics data to be stored in database tables. 
I just meant:


* pg_stat_system_events is a view to show data on memory, which returns one 
row for each event across the system.  This is similar to V$SYSTEM_EVENT in 
Oracle.


* pg_stat_session_events is a view to show data on memory, which returns one 
row for each event on one session.  This is similar to V$SESSION_EVENT in 
Oracle.


* The above views represent the current accumulated data like other 
pg_stat_xxx views.


* EXPLAIN ANALYZE and auto_explain shows all events for one query.  The lock 
waits you are trying to record in the server log is one of the events.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Heikki Linnakangas

On 08/13/2014 02:07 PM, Andres Freund wrote:

On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote:

On 08/13/2014 01:04 AM, Andres Freund wrote:

* The patch mixes the API changes around WAL records with changes of how
   individual actions are logged. That makes it rather hard to review -
   and it's a 500kb patch already.

   I realize it's hard to avoid because the new API changes which
   information about blocks is available, but it does make it really hard
   to see whether the old/new code is doing something
   equivalent. It's hard to find more critical code than this :/


Yeah, I hear you. I considered doing this piecemeal, just adding the new
functions first so that you could still use the old XLogRecData API, until
all the functions have been converted. But in the end, I figured it's not
worth it, as sooner or later we'd want to convert all the functions anyway.


I think it might be worthwile anyway. I'd be very surprised if there
aren't several significant bugs in the conversion. Your full page
checking tool surely helps to reduce the number, but it's not
foolproof. I can understand not wanting to do it though, it's a
significant amount of work.

Would you ask somebody else to do it in two steps?


Hmm, thinking about this some more, there is one sensible way to split 
this patch: We can add the XLogReplayBuffer() function and rewrite all 
the redo routines to use it, without changing any WAL record formats or 
anything in the way the WAL records are constructed. In the patch, 
XLogReplayBuffer() takes one input arument, the block reference ID, and 
it fetches the RelFileNode and BlockNumber of the block based on that. 
Without the WAL format changes, the information isn't there in the 
record, but we can require the callers to pass the RelFileNode and 
BlockNumber. The final patch will remove those arguments from every 
caller, but that's a very mechanical change.


As in the attached patch. I only modified the heapam redo routines to 
use the new XLogReplayBuffer() idiom; the idea is to do that for every 
redo routine.


After applying such a patch, the main WAL format changing patch becomes 
much smaller, and makes it easier to see from the redo routines where 
significant changes to the WAL record formats have been made. This also 
allows us to split the bikeshedding; we can discuss the name of 
XLogReplayBuffer() first :-).


- Heikki

commit 1a770baa3a3f293e8c592f0419d279b7b8bf7b66
Author: Heikki Linnakangas 
Date:   Wed Aug 13 15:39:08 2014 +0300

Refactor heapam.c redo routines to use XLogReplayBuffer

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d731f98..bf863af 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7137,15 +7137,13 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 {
 	xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record);
 	Buffer		buffer;
-	Page		page;
-	OffsetNumber *end;
-	OffsetNumber *redirected;
-	OffsetNumber *nowdead;
-	OffsetNumber *nowunused;
-	int			nredirected;
-	int			ndead;
-	int			nunused;
-	Size		freespace;
+	Size		freespace = 0;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	XLogReplayResult rc;
+
+	rnode = xlrec->node;
+	blkno = xlrec->block;
 
 	/*
 	 * We're about to remove tuples. In Hot Standby mode, ensure that there's
@@ -7156,65 +7154,62 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 	 * latestRemovedXid is invalid, skip conflict processing.
 	 */
 	if (InHotStandby && TransactionIdIsValid(xlrec->latestRemovedXid))
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-			xlrec->node);
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rnode);
 
 	/*
 	 * If we have a full-page image, restore it (using a cleanup lock) and
 	 * we're done.
 	 */
-	if (record->xl_info & XLR_BKP_BLOCK(0))
-	{
-		(void) RestoreBackupBlock(lsn, record, 0, true, false);
-		return;
-	}
+	rc = XLogReplayBufferExtended(0, rnode, MAIN_FORKNUM, blkno,
+  RBM_NORMAL, true, &buffer);
+	if (rc == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(buffer);
+		OffsetNumber *end;
+		OffsetNumber *redirected;
+		OffsetNumber *nowdead;
+		OffsetNumber *nowunused;
+		int			nredirected;
+		int			ndead;
+		int			nunused;
+
+		nredirected = xlrec->nredirected;
+		ndead = xlrec->ndead;
+		end = (OffsetNumber *) ((char *) xlrec + record->xl_len);
+		redirected = (OffsetNumber *) ((char *) xlrec + SizeOfHeapClean);
+		nowdead = redirected + (nredirected * 2);
+		nowunused = nowdead + ndead;
+		nunused = (end - nowunused);
+		Assert(nunused >= 0);
+
+		/* Update all item pointers per the record, and repair fragmentation */
+		heap_page_prune_execute(buffer,
+redirected, nredirected,
+nowdead, ndead,
+nowunused, nunused);
+
+		freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
 
-	buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL);
-	if (!BufferIsValid(buffer

Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries

2014-08-13 Thread Pavel Stehule
2014-08-13 13:59 GMT+02:00 MauMau :

> From: "Pavel Stehule" 
>
>  isn't it too heavy?
>>
>
> Are you concerned about the impactof collection overhead on the queries
> diagnosed?  Maybe not light, but I'm optimistic.  Oracle has the track
> record of long use, and MySQL provides performance schema starting from 5.6.


partially, I afraid about total performance (about impact on IO) - when we
use a usual tables, then any analyses without indexes are slow, so you need
a indexes, and we cannot deferred index update. You should thinking about
retention policy - and without partitioning you got massive deletes. So I
cannot to imagine a usage of table based solution together with some higher
load. Our MVCC storage is not practical for storing only inserted data, and
some custom storage has no indexes - so this design is relative big project.


>
>
>  I have just terrible negative experience with Vertica, where this design
>> is
>> used - almost all information about queries are available, but any query
>> to
>> related tables are terrible slow, so I am inclined to more simple design
>> oriented to log based solution. Table based solutions is not practical
>> when
>> you exec billions queries per day. I understand to motivation, but I
>> afraid
>> so it can be very expensive and slow on highly load servers.
>>
>
> Which do you mean by "query related to tables", the queries from
> applications being diagnosed, or the queries that diagnose the performance
> using statistics views?
>
> Could you elaborate on your experience with Vertica?  That trouble may be
> just because Vertica's implementation is not refined.
>
>
sure - Vertica is not mature database. More it has only one storage type
optimized for OLAP, what is wrong for long catalog, and for working with
performance events too.


> I understand the feeling of being inclined to log based solution for its
> implementation simplicity.  However, the server log is difficult or
> impossible to access using SQL queries.  This prevents the development of
> performance diagnostics functionality in GUI administration tools.  Also,
> statistics views allow for easy access on PAAS like Amazon RDS and Heroku.
>

I prefer a possibility to read log via SQL (maybe some FDW) than use tables
for storing log. These tables can be relative very large in few days - and
we cannot to write specialized engine like MySQL simply.

Pavel


>
> Regards
> MauMau
>
>


Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries

2014-08-13 Thread MauMau

From: "Pavel Stehule" 

isn't it too heavy?


Are you concerned about the impactof collection overhead on the queries 
diagnosed?  Maybe not light, but I'm optimistic.  Oracle has the track 
record of long use, and MySQL provides performance schema starting from 5.6.


I have just terrible negative experience with Vertica, where this design 
is
used - almost all information about queries are available, but any query 
to

related tables are terrible slow, so I am inclined to more simple design
oriented to log based solution. Table based solutions is not practical 
when
you exec billions queries per day. I understand to motivation, but I 
afraid

so it can be very expensive and slow on highly load servers.


Which do you mean by "query related to tables", the queries from 
applications being diagnosed, or the queries that diagnose the performance 
using statistics views?


Could you elaborate on your experience with Vertica?  That trouble may be 
just because Vertica's implementation is not refined.


I understand the feeling of being inclined to log based solution for its 
implementation simplicity.  However, the server log is difficult or 
impossible to access using SQL queries.  This prevents the development of 
performance diagnostics functionality in GUI administration tools.  Also, 
statistics views allow for easy access on PAAS like Amazon RDS and Heroku.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Andres Freund
On 2014-08-13 02:36:59 +0300, Heikki Linnakangas wrote:
> On 08/13/2014 01:04 AM, Andres Freund wrote:
> >* I'm distinctly not a fan of passing around both the LSN and the struct
> >   XLogRecord to functions. We should bit the bullet and use something
> >   encapsulating both.
> 
> You mean, the redo functions? Seems fine to me, and always it's been like
> that...

Meh. By that argument we could just not do this patch :)

> >* XLogReplayBuffer imo isn't a very good name - the buffer isn't
> >   replayed. If anything it's sometimes replaced by the backup block, but
> >   the real replay still happens outside of that function.
> 
> I agree, but haven't come up with a better name.

XLogRestoreBuffer(), XLogRecoverBuffer(), XLogReadBufferForReplay(), ...?

> >* Why do we need XLogBeginInsert(). Right now this leads to uglyness
> >   like duplicated if (RelationNeedsWAL()) wal checks and
> >   XLogCancelInsert() of inserts that haven't been started.
> 
> I like clearly marking the beginning of the insert, with XLogBeginInsert().
> We certainly could design it so that it's not needed, and you could just
> start registering stuff without calling XLogBeginInsert() first. But I
> believe it's more robust this way. For example, you will get an error at the
> next XLogBeginInsert(), if a previous operation had already registered some
> data without calling XLogInsert().

I'm coming around to that view - especially as accidentally nesting xlog
insertions is a real concern with the new API.

> >And if we have Begin, why do we also need Cancel?
> 
> We could just automatically "cancel" any previous insertion when
> XLogBeginInsert() is called, but again it seems like bad form to leave
> references to buffers and data just lying there, if an operation bails out
> after registering some stuff and doesn't finish the insertion.
> 
> >* "XXX: do we need to do this for every page?" - yes, and it's every
> >   tuple, not every page... And It needs to be done *after* the tuple has
> >   been put on the page, not before. Otherwise the location will be wrong.
> 
> That comment is in heap_multi_insert(). All the inserted tuples have the
> same command id, seems redundant to log multiple NEW_CID records for them.

Yea, I know it's in heap_multi_insert(). They tuples have the same cid,
but they don't have the same tid. Or well, wouldn't if
log_heap_new_cid() were called after the PutHeapTuple(). Note that this
won't trigger for any normal user defined relations.

> >* The patch mixes the API changes around WAL records with changes of how
> >   individual actions are logged. That makes it rather hard to review -
> >   and it's a 500kb patch already.
> >
> >   I realize it's hard to avoid because the new API changes which
> >   information about blocks is available, but it does make it really hard
> >   to see whether the old/new code is doing something
> >   equivalent. It's hard to find more critical code than this :/
> 
> Yeah, I hear you. I considered doing this piecemeal, just adding the new
> functions first so that you could still use the old XLogRecData API, until
> all the functions have been converted. But in the end, I figured it's not
> worth it, as sooner or later we'd want to convert all the functions anyway.

I think it might be worthwile anyway. I'd be very surprised if there
aren't several significant bugs in the conversion. Your full page
checking tool surely helps to reduce the number, but it's not
foolproof. I can understand not wanting to do it though, it's a
significant amount of work.

Would you ask somebody else to do it in two steps?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-08-13 Thread Andres Freund
On 2014-08-13 09:51:58 +0530, Amit Kapila wrote:
> Overall, the main changes required in patch as per above feedback
> are:
> 1. add an additional counter for the number of those
> allocations not satisfied from the free list, with a
> name like buffers_alloc_clocksweep.
> 2. Autotune the low and high threshold values for buffers
> in freelist. In the patch, I have kept them as hard-coded
> values.
> 3. For populating freelist, have a separate process (bgreclaim)
> instead of doing it by bgwriter.

I'm not convinced that 3) is the right way to go to be honest. Seems
like a huge bandaid to me.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Enable WAL archiving even in standby

2014-08-13 Thread Fujii Masao
Hi,

I'd propose the attached WIP patch which allows us to enable WAL archiving
even in standby. The patch adds "always" as the valid value of archive_mode.
If it's set to "always", the archiver is started when the server is in standby
mode and all the WAL files that walreceiver wrote to the disk are archived by
using archive_command. Then, even after the server is promoted to master,
the archiver keeps archiving WAL files. The patch doesn't change the meanings
of the setting values "on" and "off" of archive_mode.

I think that this feature is useful for the case, e.g., where large database
needs to be replicated between remote servers. Imagine the situation where
the replicated database gets corrupted completely in the remote standby.
How should we address this problematic situation and restart the standby?

One approach is to take a fresh backup from the master and restore it onto
the standby. But since the database is large and there is long distance
between two servers, this approach might take a surprisingly long time.

Another approach is to restore the backup which was taken from the standby
before. But most of many WAL files which the backup needs might exist only
in the master (because WAL archiving cannot be enabled in the standby) and
they need to be transfered from the master to the standby via long-distance
network. So I think that this approach also would take a fairly long time.
To shorten that time, you may think that archive_command in the master can
be set so that it transfers WAL files from the master to the standby's
archival storage. I agree that this setting can accelerate the database restore
process. But this causes every WAL files to be transfered between remote
servers twice (one is by streaming replication, another is by archive_command),
and which is a waste of network bandwidth.

Enabling WAL archiving in standby is one of solutions for this situation. We
can expect that most of WAL files that are required for the backup taken from
the standby would exist in the standby's archival storage.

Back to the patch. If archive_mode is set to "always", archive_command is
always used to archive WAL files even during recovery. Do we need to separate
the command into two for master and standby, respectively? We can add
something like standby_archive_command parameter which is used to archive
only WAL files walreceiver writes. The other WAL files are archived by
archive_command. I'm not sure if it's really worth separating the command
that way. Is there any use case?

The patch doesn't allow us to enable WAL archiving *only* during recovery.
Should we support yet another archive_mode like "standby" which allows
the archiver to be running only during recovery, but makes it end just after
the server is promoted to master? I'm not sure if there is really use case for
that.

I've not included the update of document in the patch yet. If we agree to
support this feature, I will do the remaining work.

Regards,

-- 
Fujii Masao
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 81,87  int			CheckPointSegments = 3;
  int			wal_keep_segments = 0;
  int			XLOGbuffers = -1;
  int			XLogArchiveTimeout = 0;
! bool		XLogArchiveMode = false;
  char	   *XLogArchiveCommand = NULL;
  bool		EnableHotStandby = false;
  bool		fullPageWrites = true;
--- 81,87 
  int			wal_keep_segments = 0;
  int			XLOGbuffers = -1;
  int			XLogArchiveTimeout = 0;
! int			XLogArchiveMode = ARCHIVE_MODE_OFF;
  char	   *XLogArchiveCommand = NULL;
  bool		EnableHotStandby = false;
  bool		fullPageWrites = true;
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***
*** 89,94 
--- 89,95 
  
  #include "access/transam.h"
  #include "access/xlog.h"
+ #include "access/xlogarchive.h"
  #include "bootstrap/bootstrap.h"
  #include "catalog/pg_control.h"
  #include "lib/ilist.h"
***
*** 823,831  PostmasterMain(int argc, char *argv[])
  		write_stderr("%s: max_wal_senders must be less than max_connections\n", progname);
  		ExitPostmaster(1);
  	}
! 	if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
  		ereport(ERROR,
! (errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
  	if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
  		ereport(ERROR,
  (errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
--- 824,832 
  		write_stderr("%s: max_wal_senders must be less than max_connections\n", progname);
  		ExitPostmaster(1);
  	}
! 	if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
  		ereport(ERROR,
! (errmsg("WAL archival (archive_mode=on/always) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
  	if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
  		ereport(ERROR,
  (errmsg("WAL streaming (max_wal_senders > 0) requires w

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-08-13 Thread Dilip kumar
On 11 August 2014 10:29, Amit kapila wrote,


1. I have fixed all the review comments except few, and modified patch is 
attached.

2. For not fixed comments, find inline reply in the mail..

>1.
>+Number of parallel connections to perform the operation. This option 
>will enable the vacuum
>+operation to run on parallel connections, at a time one table will be 
>operated on one connection.
>
>a. How about describing w.r.t asynchronous connections
>instead of parallel connections?
>b. It is better to have line length as lesser than 80.
>c. As you are using multiple connections to achieve parallelism,
>I suggest you add a line in your description indicating user should
>verify max_connections parameters.  Something similar to pg_dump:
>
>2.
>+So at one time as many tables will be vacuumed parallely as number of 
>jobs.
>
>can you briefly mention about the case when number of jobs
>is more than number of tables?


1 and 2 ARE FIXED in DOC.


>3.
>+ /* When user is giving the table list, and list is smaller then
>+ * number of tables
>+ */
>+ if (tbl_count && (parallel > tbl_count))
>+ parallel = tbl_count;
>+
>
>Again here multiline comments are wrong.
>
>Some other instances are as below:
>a.
>/* This will give the free connection slot, if no slot is free it will
>* wait for atleast one slot to get free.
>*/
>b.
>/* if table list is not provided then we need to do vaccum for whole DB
>* get the list of all tables and prpare the list
>*/
>c.
>/* Some of the slot are free, Process the results for slots whichever
>*  are free
>*/

COMMENTS are FIXED


>4.
>src/bin/scripts/vacuumdb.c:51: indent with spaces.
>+ bool analyze_only, bool freeze, PQExpBuffer sql);
>src/bin/scripts/vacuumdb.c:116: indent with spaces.
>+int parallel = 0;
>src/bin/scripts/vacuumdb.c:198: indent with spaces.
>+optind++;
>src/bin/scripts/vacuumdb.c:299: space before tab in indent.
>+   vacuum_one_database(dbname, full, verbose, and_analyze,
>
>There are lot of redundant whitespaces, check with
>git diff --check and fix them.

ALL are FIXED

>
>5.
>res = executeQuery(conn,
>"select relname, nspname from pg_class c, pg_namespace ns"
>" where (relkind = \'r\' or relkind = \'m\')"
>" and c.relnamespace = ns.oid order by relpages desc",
>progname, echo);
>
>a. Here you need to use SQL keywords in capital letters, refer one

>of the other caller of executeQuery() in vacuumdb.c
>b. Why do you need this condition c.relnamespace = ns.oid in above
>query?

IT IS POSSIBLE THAT, TWO NAMESPACE HAVE THE SAME TABLE NAME, SO WHEN WE ARE 
SENDING COMMAND FROM CLIENT WE NEED TO GIVE NAMESPACE WISE BECAUSE WE NEED TO 
VACUUM ALL THE
TABLES.. (OTHERWISE TWO TABLE WITH SAME NAME FROM DIFFERENT NAMESPACE WILL BE 
TREATED SAME.)

>I think to get the list of required objects from pg_class, you don't
>need to have a join with pg_namespace.

DONE

>6.
>vacuum_parallel()
>{
>..
>if (!tables || !tables->head)
>{
>..
>tbl_count++;
>}
>..
>}
>
>a. Here why you need a separate variable (tbl_count) to verify number
>asynchronous/parallel connections you want, why can't we use ntuple?
>b. there is a warning in code (I have compiled it on windows) as well
>related to this variable.
>vacuumdb.c(695): warning C4700: uninitialized local variable 'tbl_count' used

Variable REMOVED

>
>7.
>Fix for one of my previous comment is as below:
>GetQueryResult()
>{
>..
>if (!r && !completedb)
>..
>}
>
>Here I think some generic errors like connection broken or others
>will also get ignored.  Is it possible that we can ignore particular
>error which we want to ignore without complicating the code?
>
>Also in anycase add comments to explain why you are ignoring
>error for particular case.

Here we are getting message string, I think if we need to find the error code 
then we need to parse the string, and after that we need to compare with error 
codes.
Is there any other way to do this ?
Comments are added


>8.
>+ fprintf(stderr, _("%s: Number of parallel \"jobs\" should be at least 1\n"),
>+ progname);
>formatting of 2nd line progname is not as per standard (you can refer other 
>fprintf in the same file).

DONE

>9. + int parallel = 0;
>I think it is better to name it as numAsyncCons or something similar.

CHANGED as PER SUGGESTION

>10. It is better if you can add function header for newly added
>functions.

ADDED
>>
>> Test2: (as per your scenario, where actual vacuum time is very less)
>>
>> Vacuum done for complete DB
>>
>> 8 tables all with 1 records and few dead tuples
>
>I think this test is missing in attached file.  Few means?
>Can you try with 0.1%, 1% of dead tuples in table and try to
>take time in milliseconds if it is taking less time to complete
>the test.

TESTED with 1%, 0.1% and 0.01 % and results are as follows

1.   With 1% (file test1%.sql)


Base Code  :  22.26 s
2 Threads   : 12.82 s
4 Threads  : 9.

Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-13 Thread Tomas Vondra
On 13 Srpen 2014, 7:02, Jeff Davis wrote:
> On Tue, 2014-08-12 at 14:58 +0200, Tomas Vondra wrote:
>> CREATE AGGREGATE myaggregate (
>> ...
>> SERIALIZE_FUNC = 'dump_data',
>> DESERIALIZE_FUNC = 'read_data',
>> ...
>> );
>
> Seems reasonable.
>
>> I don't see why it should get messy? In the end, you have a chunk of
>> data and a hash for it.
>
> Perhaps it's fine; I'd have to see the approach.
>
>> It just means you need to walk through the hash table, look at the
>> hashes and dump ~50% of the groups to a file.
>
> If you have fixed-size states, why would you *want* to remove the group?
> What is gained?

You're right that for your batching algorithm (based on lookups), that's
not really needed, and keeping everything in memory is a good initial
approach.

My understanding of the batching algorithm (and I may be wrong on this
one) is that once you choose the number of batches, it's pretty much
fixed. Is that the case?

But what will happen in case of significant cardinality underestimate?
I.e. what will happen if you decide to use 16 batches, and then find
out 256 would be more appropriate? I believe you'll end up with batches
16x the size you'd want, most likely exceeding work_mem.

Do I understand that correctly?

But back to the removal of aggregate states from memory (irrespectedly
of the size) - this is what makes the hashjoin-style batching possible,
because it:

   (a) makes the batching decision simple (peeking at hash)
   (b) makes it possible to repeatedly increase the number of batches
   (c) provides a simple trigger for the increase of batch count

Some of this might be achievable even with keeping the states in memory.
I mean, you can add more batches on the fly, and handle this similarly
to hash join, while reading tuples from the batch (moving the tuples to
the proper batch, if needed).

The problem is that once you have the memory full, there's no trigger
to alert you that you should increase the number of batches again.

> One thing I like about my simple approach is that it returns a good
> number of groups after each pass, and then those are completely finished
> (returned to the operator above, even). That's impossible with HashJoin
> because the hashing all needs to be done before the probe phase begins.

The hash-join approach returns ~1/N groups after each pass, so I fail to
see how this is better?

> The weakness of my approach is the array_agg case that you mention,
> because this approach doesn't offer a way to dump out transition states.
> It seems like that could be added later, but let me know if you see a
> problem there.

Right. Let's not solve this in the first version of the patch.

>> I think you're missing the point, here. You need to compute the hash in
>> both cases. And then you either can do a lookup or just peek at the
>> first
>> few bits of the hash to see whether it's in the current batch or not.
>
> I understood that. The point I was trying to make (which might or might
> not be true) was that: (a) this only matters for a failed lookup,
> because a successful lookup would just go in the hash table anyway; and
> (b) a failed lookup probably doesn't cost much compared to all of the
> other things that need to happen along that path.

OK. I don't have numbers proving otherwise at hand, and you're probably
right that once the batching kicks in, the other parts are likely more
expensive than this.

> I should have chosen a better example though. For instance: if the
> lookup fails, we need to write the tuple, and writing the tuple is sure
> to swamp the cost of a failed hash lookup.
>
>> is much faster than a lookup. Also, as the hash table grows (beyond L3
>> cache size, which is a few MBs today), it becomes much slower in my
>> experience - that's one of the lessons I learnt while hacking on the
>> hashjoin. And we're dealing with hashagg not fitting into work_mem, so
>> this seems to be relevant.
>
> Could be, but this is also the path that goes to disk, so I'm not sure
> how significant it is.

It may or may not go to the disk, actually. The fact that you're doing
batching means it's written to a temporary file, but with large amounts
of RAM it may not get written to disk.

That's because the work_mem is only a very soft guarantee - a query may
use multiple work_mem buffers in a perfectly legal way. So the users ten
to set this rather conservatively. For example we have >256GB of RAM in
each machine, usually <24 queries running at the same time and yet we
have only work_mem=800MB. On the few occasions when a hash join is
batched, it usually remains in page cache and never actually gets writte
to disk. Or maybe it gets written, but it's still in the page cache so
the backend never notices that.

It's true there are other costs though - I/O calls, etc. So it's not free.

>
>> > Because I suspect there are costs in having an extra file around that
>> > I'm not accounting for directly. We are implicitly assuming that the
>> OS
>> > will keep around e

Re: [HACKERS] strncpy is not a safe version of strcpy

2014-08-13 Thread David Rowley
On Wed, Aug 13, 2014 at 3:19 PM, Noah Misch  wrote:

> On Sat, Nov 16, 2013 at 12:53:10PM +1300, David Rowley wrote:
> > I went on a bit of a strncpy cleanup rampage this morning and ended up
> > finding quite a few places where strncpy is used wrongly.
> > I'm not quite sure if I have got them all in this patch, but I' think
> I've
> > got the obvious ones at least.
> >
> > For the hash_search in jsconfuncs.c after thinking about it a bit more...
> > Can we not just pass the attname without making a copy of it? I see
> keyPtr
> > in hash_search is const void * so it shouldn't get modified in there. I
> > can't quite see the reason for making the copy.
>
> +1 for the goal of this patch.  Another commit took care of your
> jsonfuncs.c
> concerns, and the patch for CVE-2014-0065 fixed several of the others.
>  Plenty
> remain, though.
>
>
Thanks for taking interest in this.
I had a quick look at the usages of strncpy in master tonight and I've
really just picked out the obviously broken ones for now. The other ones,
on first look, either look safe, or require some more analysis to see
what's actually done with the string.

I think this is likely best tackled in small increments anyway.

Does anyone disagree with the 2 changes in the attached?

Regards

David Rowley


strncpy_fixes_pass1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries

2014-08-13 Thread Pavel Stehule
2014-08-13 11:14 GMT+02:00 MauMau :

> From: "Pavel Stehule" 
>
>  There are two relative independent tasks
>>
>> a) monitor and show total lock time of living queries
>>
>> b) monitor and log total lock time of executed queries.
>>
>> I am interested by @b now. When we work with slow query log, then we would
>> to identify reason for long duration. Locks are important source of these
>> queries on some systems.
>>
>
> I'm interested in b, too.  I was thinking of proposing a performance
> diagnostics feature like Oracle's wait events (V$SYSTEM_EVENT and
> V$SESSION_EVENT).  So, if you do this, I'd like to contribute to the
> functional design, code and doc review, and testing.
>

isn't it too heavy?

I have just terrible negative experience with Vertica, where this design is
used - almost all information about queries are available, but any query to
related tables are terrible slow, so I am inclined to more simple design
oriented to log based solution. Table based solutions is not practical when
you exec billions queries per day. I understand to motivation, but I afraid
so it can be very expensive and slow on highly load servers.


>
> The point is to collect as much information about bottlenecks as possible,
> including lock waits.  The rough sketch is:
>
> What info to collect:
> * heavyweight lock waits shown by pg_locks
> * lightweight lock waits
> * latch waits
> * socket waits (mainly for client input)
>




>
> How the info is delivered:
> * pg_stat_system_events shows the accumulated total accross the server
> instance
> * pg_stat_session_events shows the accumulated total for each session
> * EXPLAIN ANALYZE and auto_explain shows the accumulated total for each
> query
>
> We need to describe in the manual how to diagnose and tne the system with
> these event info.
>
> Regards
> MauMau
>
>


Re: [HACKERS] proposal for 9.5: monitoring lock time for slow queries

2014-08-13 Thread MauMau

From: "Pavel Stehule" 

There are two relative independent tasks

a) monitor and show total lock time of living queries

b) monitor and log total lock time of executed queries.

I am interested by @b now. When we work with slow query log, then we would
to identify reason for long duration. Locks are important source of these
queries on some systems.


I'm interested in b, too.  I was thinking of proposing a performance 
diagnostics feature like Oracle's wait events (V$SYSTEM_EVENT and 
V$SESSION_EVENT).  So, if you do this, I'd like to contribute to the 
functional design, code and doc review, and testing.


The point is to collect as much information about bottlenecks as possible, 
including lock waits.  The rough sketch is:


What info to collect:
* heavyweight lock waits shown by pg_locks
* lightweight lock waits
* latch waits
* socket waits (mainly for client input)

How the info is delivered:
* pg_stat_system_events shows the accumulated total accross the server 
instance

* pg_stat_session_events shows the accumulated total for each session
* EXPLAIN ANALYZE and auto_explain shows the accumulated total for each 
query


We need to describe in the manual how to diagnose and tne the system with 
these event info.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback

2014-08-13 Thread furuyao
> I don't think that it's good idea to control that behavior by using
> --status-interval. I'm sure that there are some users who both want that
> behavior and want set the maximum interval between a feedback is sent
> back to the server because these settings are available in walreceiver.
> But your version of --status-interval doesn't allow those settings at
> all. That is, if set to -1, the maximum interval between each feedback
> cannot be set. OTOH, if set to the positive number,
> feedback-as-soon-as-fsync behavior cannot be enabled. Therefore, I'm
> thinking that it's better to introduce new separate option for that
> behavior.

Thanks for the review!
This patch was split option as you pointed out.

If -r option is specified, status packets sent to server as soon as after fsync.
Otherwise to send server status packet in the spacing of the --status-interval.

Regards,

--
Furuya Osamu


pg_receivexlog-fsync-feedback-v2.patch
Description: pg_receivexlog-fsync-feedback-v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Michael Paquier
On Wed, Aug 13, 2014 at 4:49 PM, Michael Paquier
 wrote:
> Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c,
> it is not weird. This approach has the merit to make XLogRecData
> completely bundled with the insertion and construction of the WAL
> records. Then for the name xloginsert.c is fine for me too.
At the same time, renaming XLogInsert to XLogCompleteInsert or
XLogFinishInsert would be nice to make the difference with
XLogBeginInsert. This could include XLogCancel renamed tos
XLogCancelInsert.

Appending the prefix XLogInsert* to those functions would make things
more consistent as well.

But feel free to discard those ideas if you do not like that.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL format and API changes (9.5)

2014-08-13 Thread Michael Paquier
On Tue, Aug 12, 2014 at 6:44 PM, Heikki Linnakangas
 wrote:
> On 08/05/2014 03:50 PM, Michael Paquier wrote:
>>
>> - When testing pg_xlogdump I found an assertion failure for record
>> XLOG_GIN_INSERT:
>> Assertion failed: (((bool) (((const void*)(&insertData->newitem.key) !=
>> ((void*)0)) && ((&insertData->newitem.key)->ip_posid != 0, function
>> gin_desc, file gindesc.c, line 127.
>
>
> I could not reproduce this. Do you have a test case?
Indeed. I am not seeing anything now for this record. Perhaps I was
using an incorrect version of pg_xlogdump.

>> - Don't we need a call to XLogBeginInsert() in XLogSaveBufferForHint():
>> +   int flags = REGBUF_FORCE_IMAGE;
>> +   if (buffer_std)
>> +   flags |= REGBUF_STANDARD;
>> +   XLogRegisterBuffer(0, buffer, flags);
>> [...]
>> -   recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI, rdata);
>> +   recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
>
>
> Indeed, fixed. With that, "initdb -k" works, I had not tested the patch with
> page checksums before.
Thanks for the fix. Yes now this works.

>> - XLogRecGetBlockRefIds needing an already-allocated array for *out is not
>> user-friendly. Cannot we just do all the work inside this function?
>
>
> I agree it's not a nice API. Returning a palloc'd array would be nicer, but
> this needs to work outside the backend too (e.g. pg_xlogdump). It could
> return a malloc'd array in frontend code, but it's not as nice. On the
> whole, do you think that be better than the current approach?

A palloc'd array returned is nicer for the user. And I would rather
rewrite the API like that, having it return the array instead:
int *XLogRecGetBlockRefIds(XLogRecord *record, int *num_array)
Alvaro has also outlined that in the FRONTEND context pfree and palloc
would work as pg_free and pg_malloc (didn't recall it before he
mentioned it btw).

>> - All the functions in xlogconstruct.c could be defined in a separate
>> header xlogconstruct.h. What they do is rather independent from xlog.h.
>> The
>> same applies to all the structures and functions of xlogconstruct.c in
>> xlog_internal.h: XLogRecordAssemble, XLogRecordAssemble,
>> InitXLogRecordConstruct. (worth noticing as well that the only reason
>> XLogRecData is defined externally of xlogconstruct.c is that it is being
>> used in XLogInsert()).
>
>
> Hmm. I left the defines for xlogconstruct.c functions that are used to build
> a WAL record in xlog.h, so that it's not necessary to include both xlog.h
> and xlogconstruct.h in all files that write a WAL record (XLogInsert() is
> defined in xlog.h).
>
> But perhaps it would be better to move the prototype for XLogInsert to
> xlogconstruct.h too? That might be a better division; everything needed to
> insert a WAL record would be in xlogconstruct.h, and xlog.h would left for
> more internal functions related to WAL. And perhaps rename the files to
> xloginsert.c and xloginsert.h.

Yes indeed. As XLogBeginInsert() is already part of xlogconstruct.c,
it is not weird. This approach has the merit to make XLogRecData
completely bundled with the insertion and construction of the WAL
records. Then for the name xloginsert.c is fine for me too.

Among the things changed since v5, v6 is introducing a safety limit to
the maximum number of backup blocks within a single WAL record:
#define XLR_MAX_BKP_BLOCKS 256
XLogRecordBlockData is updated to use uint8 instead of char to
identify the block id, and XLogEnsureRecordSpace fails if more than
XLR_MAX_BKP_BLOCKS are wanted by the caller. I am fine with this
change, just mentioning it.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers

2014-08-13 Thread Michael Paquier
On Wed, Aug 13, 2014 at 2:10 PM, Fujii Masao  wrote:
> I sent the SIGSTOP signal to the walreceiver process in one of sync standbys,
> and then ran write transactions again. In this case, they must not be 
> completed
> because their WAL cannot be replicated to the standby that its walreceiver
> was stopped. But they were successfully completed.

At the end of SyncRepReleaseWaiters, SYNC_REP_WAIT_WRITE and
SYNC_REP_WAIT_FLUSH in walsndctl were able to update with only one wal
sender in sync, making backends wake up even if other standbys did not
catch up. But we need to scan all the synchronous wal senders and find
the minimum write and flush positions and update walsndctl with those
values. Well that's a code path I forgot to cover.

Attached is an updated patch fixing the problem you reported.

Regards,
-- 
Michael
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 2586,2597  include_dir 'conf.d'
  Specifies a comma-separated list of standby names that can support
  synchronous replication, as described in
  .
! At any one time there will be at most one active synchronous standby;
! transactions waiting for commit will be allowed to proceed after
! this standby server confirms receipt of their data.
! The synchronous standby will be the first standby named in this list
! that is both currently connected and streaming data in real-time
! (as shown by a state of streaming in the
  
  pg_stat_replication view).
  Other standby servers appearing later in this list represent potential
--- 2586,2598 
  Specifies a comma-separated list of standby names that can support
  synchronous replication, as described in
  .
! At any one time there will be at a number of active synchronous standbys
! defined by synchronous_standby_num; transactions waiting
! for commit will be allowed to proceed after those standby servers
! confirms receipt of their data. The synchronous standbys will be
! the first entries named in this list that are both currently connected
! and streaming data in real-time (as shown by a state of
! streaming in the
  
  pg_stat_replication view).
  Other standby servers appearing later in this list represent potential
***
*** 2627,2632  include_dir 'conf.d'
--- 2628,2652 

   
  
+  
+   synchronous_standby_num (integer)
+   
+synchronous_standby_num configuration parameter
+   
+   
+   
+
+ Specifies the number of standbys that support
+ synchronous replication, as described in
+ , and listed as the first
+ elements of .
+
+
+ Default value is 1.
+
+   
+  
+ 
   
vacuum_defer_cleanup_age (integer)

*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***
*** 1081,1092  primary_slot_name = 'node_a_slot'
  WAL record is then sent to the standby. The standby sends reply
  messages each time a new batch of WAL data is written to disk, unless
  wal_receiver_status_interval is set to zero on the standby.
! If the standby is the first matching standby, as specified in
! synchronous_standby_names on the primary, the reply
! messages from that standby will be used to wake users waiting for
! confirmation that the commit record has been received. These parameters
! allow the administrator to specify which standby servers should be
! synchronous standbys. Note that the configuration of synchronous
  replication is mainly on the master. Named standbys must be directly
  connected to the master; the master knows nothing about downstream
  standby servers using cascaded replication.
--- 1081,1092 
  WAL record is then sent to the standby. The standby sends reply
  messages each time a new batch of WAL data is written to disk, unless
  wal_receiver_status_interval is set to zero on the standby.
! If the standby is the first synchronous_standby_num matching
! standbys, as specified in synchronous_standby_names on the
! primary, the reply messages from that standby will be used to wake users
! waiting for confirmation that the commit record has been received. These
! parameters allow the administrator to specify which standby servers should
! be synchronous standbys. Note that the configuration of synchronous
  replication is mainly on the master. Named standbys must be directly
  connected to the master; the master knows nothing about downstream
  standby servers using cascaded replication.
***
*** 1169,1177  primary_slot_name = 'node_a_slot'
  The best solution for avoiding data loss is to ensure you don't lose
  your last remaining synchronous stan