Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-16 Thread Arthur Silva
On Fri, Aug 15, 2014 at 8:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Arthur Silva arthur...@gmail.com writes:
  We should add some sort of versionning to the jsonb format. This can be
  explored in the future in many ways.

 If we end up making an incompatible change to the jsonb format, I would
 support taking the opportunity to stick a version ID in there.  But
 I don't want to force a dump/reload cycle *only* to do that.

  As for the current problem, we should explore the directory at the end
  option. It should improve compression and keep good access performance.

 Meh.  Pushing the directory to the end is just a band-aid, and since it
 would still force a dump/reload, it's not a very enticing band-aid.
 The only thing it'd really fix is the first_success_by issue, which
 we could fix *without* a dump/reload by using different compression
 parameters for jsonb.  Moving the directory to the end, by itself,
 does nothing to fix the problem that the directory contents aren't
 compressible --- and we now have pretty clear evidence that that is a
 significant issue.  (See for instance Josh's results that increasing
 first_success_by did very little for the size of his dataset.)

 I think the realistic alternatives at this point are either to
 switch to all-lengths as in my test patch, or to use the hybrid approach
 of Heikki's test patch.  IMO the major attraction of Heikki's patch
 is that it'd be upward compatible with existing beta installations,
 ie no initdb required (but thus, no opportunity to squeeze in a version
 identifier either).  It's not showing up terribly well in the performance
 tests I've been doing --- it's about halfway between HEAD and my patch on
 that extract-a-key-from-a-PLAIN-stored-column test.  But, just as with my
 patch, there are things that could be done to micro-optimize it by
 touching a bit more code.

 I did some quick stats comparing compressed sizes for the delicio.us
 data, printing quartiles as per Josh's lead:

 all-lengths {440,569,609,655,1257}
 Heikki's patch  {456,582,624,671,1274}
 HEAD{493,636,684,744,1485}

 (As before, this is pg_column_size of the jsonb within a table whose rows
 are wide enough to force tuptoaster.c to try to compress the jsonb;
 otherwise many of these values wouldn't get compressed.)  These documents
 don't have enough keys to trigger the first_success_by issue, so that
 HEAD doesn't look too awful, but still there's about an 11% gain from
 switching from offsets to lengths.  Heikki's method captures much of
 that but not all.

 Personally I'd prefer to go to the all-lengths approach, but a large
 part of that comes from a subjective assessment that the hybrid approach
 is too messy.  Others might well disagree.

 In case anyone else wants to do measurements on some more data sets,
 attached is a copy of Heikki's patch updated to apply against git tip.

 regards, tom lane


I agree that versioning might sound silly at this point, but lets keep it
in mind.
Row level compression is very slow itself, so it sounds odd to me paying
25% performance penalty everywhere for the sake of having better
compression ratio in the dictionary area.
Consider, for example, an optimization that stuffs integers (up to 28 bits)
inside the JEntry itself. That alone would save 8 bytes for each integer.


Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback

2014-08-16 Thread Sawada Masahiko
On Wed, Aug 13, 2014 at 5:55 PM,  furu...@pm.nttdata.co.jp wrote:
 I don't think that it's good idea to control that behavior by using
 --status-interval. I'm sure that there are some users who both want that
 behavior and want set the maximum interval between a feedback is sent
 back to the server because these settings are available in walreceiver.
 But your version of --status-interval doesn't allow those settings at
 all. That is, if set to -1, the maximum interval between each feedback
 cannot be set. OTOH, if set to the positive number,
 feedback-as-soon-as-fsync behavior cannot be enabled. Therefore, I'm
 thinking that it's better to introduce new separate option for that
 behavior.

 Thanks for the review!
 This patch was split option as you pointed out.

 If -r option is specified, status packets sent to server as soon as after 
 fsync.
 Otherwise to send server status packet in the spacing of the 
 --status-interval.


Hi,

I took a look at this patch.
I applied patch to master successfully, and did not get error with compiling.
Also it works fine.

One question is why reply_fsync is defined as volatile variable?
Sorry I could not understand reason of that.

Currently patch modifies argument of some function (e.g., Handle
CopyStream, Process LogDate Msg), and add the similar code to each
function.
I don't think it is good approach.
For example, I think that we should gather these code into one function.

Thought?

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

2014-08-16 Thread Sawada Masahiko
On Fri, Aug 8, 2014 at 11:45 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Fri, Aug 8, 2014 at 11:43 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Apr 14, 2014 at 11:40 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Fujii Masao masao.fu...@gmail.com writes:
 On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas robertmh...@gmail.com wrote:
 Should we try to install some hack around fastupdate for 9.4?  I fear
 the divergence between reasonable values of work_mem and reasonable
 sizes for that list is only going to continue to get bigger.  I'm sure
 there's somebody out there who has work_mem = 16GB, and stuff like
 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
 appeal of large values.

 Controlling the threshold of the size of pending list only by GUC doesn't
 seem reasonable. Users may want to increase the threshold only for the
 GIN index which can be updated heavily, and decrease it otherwise. So
 I think that it's better to add new storage parameter for GIN index to 
 control
 the threshold, or both storage parameter and GUC.

 Yeah, -1 for a GUC.  A GIN-index-specific storage parameter seems more
 appropriate.

 The attached patch introduces the GIN index storage parameter
 PENDING_LIST_CLEANUP_SIZE which specifies the maximum size of
 GIN pending list. If it's not set, work_mem is used as that maximum size,
 instead. So this patch doesn't break the existing application which
 currently uses work_mem as the threshold of cleanup operation of
 the pending list. It has only not to set PENDING_LIST_CLEANUP_SIZE.

 This is an index storage parameter, and allows us to specify each
 threshold per GIN index. So, for example, we can increase the threshold
 only for the GIN index which can be updated heavily, and decrease it 
 otherwise.

 This patch uses another patch that I proposed (*1) as an infrastructure.
 Please apply that infrastructure patch first if you apply this patch.

 (*1)
 http://www.postgresql.org/message-id/CAHGQGwEanQ_e8WLHL25=bm_8z5zkyzw0k0yir+kdmv2hgne...@mail.gmail.com

 Regards,

 --
 Fujii Masao

 Sorry, I forgot to attached the patch This time, attached.


I think that this patch should be rebased.
It contains garbage code.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-16 Thread Noah Misch
Nice algorithm.

On Fri, Aug 15, 2014 at 02:16:08PM -0400, Robert Haas wrote:
 On Fri, Aug 15, 2014 at 12:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  * We talked about combining this technique with a plain file lock
  so that we would have belt-and-suspenders protection, in particular
  something that would have a chance of working across NFS clients.
  This would suggest leaving the fcntl lock in place, ie, don't do
  step 11, and also that the file-to-be-locked *not* have any other
  purpose (which would only increase the risk of losing the lock
  through careless open/close).
 
 I'd be afraid that a secondary mechanism that mostly-but-not-really
 works could do more harm by allowing us to miss bugs in the primary,
 pipe-based locking mechanism than the good it would accomplish.

Users do corrupt their NFS- and GFS2-hosted databases today.  I would rather
have each process hold only an fcntl() lock than hold only the FIFO file
descriptor.  There's no such dichotomy, so let's have both.

  Meh.  Do we really want to allow a new postmaster to start if there
  are any processes remaining that were launched by backends?  I'd
  be inclined to just suppress close-on-exec, period.
 
 Seems like a pretty weird and artificial restriction.  Anything that
 has done exec() will not be connected to shared memory, so it really
 doesn't matter whether it's still alive or not.  People can and do
 write extensions that launch processes from PostgreSQL backends via
 fork()+exec(), and we've taken pains in the past not to break such
 cases.  I don't see a reason to impose now (for no
 data-integrity-related reason) the rule that any such processes must
 not be daemons.

+1

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench --tuple-size option

2014-08-16 Thread Fabien COELHO



The custom initialization is to run a manual ALTER after the
initialization.


Sure, it can be done this way.

I'm not sure about the implication of ALTER on the table storage,


Should be fine in this case.


After some testing and laughing, my conclusion is not fine at all. The 
filler attributes in pgbench are by default EXTENDED, which mean 
possibly compressed... As the the default value is '', the compression, 
when tried for large sizes, performs very well, and the performance is the 
same as with a (declared) smaller tuple:-) Probably not the intention of 
the benchmark designer. Conclusion: I need an ALTER TABLE anyway to change 
the STORAGE. Or maybe pgbench should always do it anyway...


Conclusion 2: I've noted the submission as rejected as both you and 
Fujii don't like it, and although I found it useful, but I can do without 
it quite easily.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] option -T in pg_basebackup doesn't work on windows

2014-08-16 Thread Amit Kapila
On Fri, Aug 15, 2014 at 1:03 PM, MauMau maumau...@gmail.com wrote:

 Thank you.  The code looks correct.  I confirmed that the pg_basebackup
could relocate the tablespace directory on Windows.

 I marked this patch as ready for committer.

Thanks for the review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-08-16 Thread Tomas Vondra


On 12.8.2014 00:30, Tomas Vondra wrote:
 On 11.8.2014 20:25, Robert Haas wrote:
 It also strikes me that when there's only 1 batch, the set of bits
 that map onto the batch number is zero-width, and one zero-width bit
 range is as good as another.  In other words, if you're only planning
 to do one batch, you can easily grow the number of buckets on the fly.
 Growing the number of buckets only becomes difficult once you have
 more than one batch.

...

 I was considering using reversing the bits of the hash, because that's
 pretty much the simplest solution. But I think you're right it might
 actually work like this:

   * Are more batches needed?

   (yes) = just use nbuckets = my_log2(work_mem / tuple_size)

   (no) = go ahead, until processing all tuples or hitting work_mem

   (work_mem) = meh, use the same nbuckets above

   (all tuples) = compute optimal nbuckets / resize


 But I need to think about this a bit. So far it seems to me there's no
 way additional batches might benefit from increasing nbuckets further.

I think this is a simple and solid solution, solving the batchno
computation issues quite nicely. Attached is v10 patch (bare and
combined with the dense allocation), that does this:

1) when we know we'll need batching, buckets are sized for full work_mem
   (using the estimated tuple width, etc.)

2) without the batching, we estimate the 'right number of buckets' for
   the estimated number of tuples, and keep track of the optimal number
   as tuples are added to the hash table

   - if we discover we need to start batching, we keep the current
 optimal value (which should be the same as the max number of
 buckets) and don't mess with it anymore (making it possible to
 compute batch IDs just like before)

   - also, on the fist rebatch (nbatch=1 = nbatch=2) the hash table
 is resized as part of the rebatch

   - if the hash build completes without batching, we do the resize

I believe the patch is pretty much perfect. I plan to do more thorough
testing on a wide range of queries in the next few days.

I also removed the 'enable_hash_resize' GUC, because it would be more
complex to implement this properly after doing the resize as part of
rebatch etc.. So either it would make the patch more complex, or it
wouldn't do what the name promises.

regards
Tomas

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..ee3fffb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es-format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong(Hash Buckets, hashtable-nbuckets, es);
+			ExplainPropertyLong(Original Hash Buckets,
+hashtable-nbuckets_original, es);
 			ExplainPropertyLong(Hash Batches, hashtable-nbatch, es);
 			ExplainPropertyLong(Original Hash Batches,
 hashtable-nbatch_original, es);
 			ExplainPropertyLong(Peak Memory Usage, spacePeakKb, es);
 		}
-		else if (hashtable-nbatch_original != hashtable-nbatch)
+		else if ((hashtable-nbatch_original != hashtable-nbatch) || (hashtable-nbuckets_original != hashtable-nbuckets))
 		{
 			appendStringInfoSpaces(es-str, es-indent * 2);
 			appendStringInfo(es-str,
-			Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n,
-			 hashtable-nbuckets, hashtable-nbatch,
-			 hashtable-nbatch_original, spacePeakKb);
+			Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n,
+			 hashtable-nbuckets, hashtable-nbuckets_original,
+			 hashtable-nbatch, hashtable-nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..a5d4812 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,8 @@
 #include utils/lsyscache.h
 #include utils/syscache.h
 
-
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -47,6 +47,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
+
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)-nbuckets_optimal * sizeof(HashJoinTuple))
 
 /* 
  *		ExecHash
@@ -116,6 +120,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable-skewTuples += 1;
 			}
 			else
 			{
@@ -126,10 +131,30 @@ 

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-16 Thread Tomas Vondra
On 10.8.2014 22:50, Jeff Davis wrote:
 On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote:
 Either way, it's better to be conservative. Attached is a version
 of the patch with opt-in memory usage tracking. Child contexts
 inherit the setting from their parent.
 
 There was a problem with the previous patch: the parent was unlinked 
 before the delete_context method was called, which meant that the 
 parent's memory was not being decremented.
 
 Attached is a fix. It would be simpler to just reorder the operations
 in MemoryContextDelete, but there is a comment warning against that.
 So, I pass the old parent as an argument to the delete_context
 method.

Hi,

I did a quick check of the patch, mostly because I wasn't sure whether
it allows accounting for a selected subtree only (e.g. aggcontext and
it's children). And yes, it does exactly that, which is great.

Also, the code seems fine to me, except maybe for this piece in
AllocSetDelete:

/*
 * Parent is already unlinked from context, so can't use
 * update_allocation().
 */
while (parent != NULL)
{
parent-total_allocated -= context-total_allocated;
parent = parent-parent;
}

I believe this should check parent-track_mem, just like
update_allocation, because this way it walks all the parent context up
to the TopMemoryContext.

It does not really break anything (because parents without enabled
tracking don't report allocated space), but for plans
creating/destroying a lot of contexts (say, GroupAggregate with a lot of
groups) this might unnecessarily add overhead.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication commands and log_statements

2014-08-16 Thread Amit Kapila
On Thu, Aug 14, 2014 at 7:19 PM, Stephen Frost sfr...@snowman.net wrote:
 * Amit Kapila (amit.kapil...@gmail.com) wrote:
  On Thu, Aug 14, 2014 at 5:56 AM, Stephen Frost sfr...@snowman.net
wrote:
   Regarding this, I'm generally in the camp that says to just include it
   in 'all' and be done with it- for now.
 
  Okay, but tomorrow if someone wants to implement a patch to log
  statements executed through SPI (statements inside functions), then
  what will be your suggestion, does those also can be allowed to log
  with 'all' option or you would like to suggest him to wait for a proper
  auditing system?

 No, I'd suggest having a different option for it as that would be a huge
 change for people who are already doing 'all', potentially.

Not only that, I think another important reason is that those represent
separate set of functionality and some users could be interested to
log those statements alone or along with other set of commands.

  Adding the
 replication commands is extremely unlikely to cause individuals who are
 already logging 'all' any problems, as far as I can tell.

As 'all' is superset for all commands, so anyway it would have been
logged under 'all', the point is that whether replication commands
can be considered to have a separate log level parameter.  To me, it
appears that it deserves a separate log level parameter even though
today number of commands which fall under this category are less,
however it can increase tomorrow.  Also if tomorrow there is a consensus
on how to extend log_statement parameter, it is quite likely that
someone will come up with a patch to consider replication commands
as separate log level.

I think ideally it would have been better if we could have logged
replication commands under separate log_level, but as still there
is no consensus on extending log_statement and nobody is even
willing to pursue, it seems okay to go ahead and log these under
'all' level.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-16 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 I believe this should check parent-track_mem, just like
 update_allocation, because this way it walks all the parent context up
 to the TopMemoryContext.

TBH, I don't think this opt-in tracking business has been thought
through even a little bit; for example, what happens if there is a
declared-to-not-track context in between levels that are declared
to track?  And it certainly has not been proven that the extra
complexity actually saves cycles rather than the reverse.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-16 Thread Tomas Vondra
On 16.8.2014 20:00, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 I believe this should check parent-track_mem, just like
 update_allocation, because this way it walks all the parent context up
 to the TopMemoryContext.
 
 TBH, I don't think this opt-in tracking business has been thought 
 through even a little bit; for example, what happens if there is a 
 declared-to-not-track context in between levels that are declared to
 track? And it certainly has not been proven that the extra complexity
 actually saves cycles rather than the reverse.


AFAIK there's no other built-in accounting mechanism at the moment.
That's one of the reasons why some of the nodes had to invent their own
custom accounting - hashjoin is an example of that.

We do have MemoryContextStats, which essentially walks the memory
context hierarchy and accounts all the important information. While
working on the hashjoin patch, I was thinking that maybe we could create
MemoryContextStats clone, computing the accounting info on request.

So I did an experiment, writing this new MemoryContextStats clone and
rewriting the nodeHash.c to use this instead of custom tracking. The
impact was consistently ~2-3%, for the hashjoin queries I measured (e.g.
9100 ms instead of 8900 ms etc.).

I just tried the same with the hashjoin patch applied. With the built-in
accounting the queries now run ~2.5x faster - around 3500ms. With the
MemoryContextStats clone, it takes a whopping 18000 ms.

I don't know what causes such increase (I assume I use the accounting
info within a loop or something). The point is that with Jeff's patch,
it goes down to 3500 ms, with pretty much zero overhead.

I'll check why the patched hashjoin is so much slower, but even if I can
optimize it somehow I don't think I'll get it to 0 overhead. So I'd say
this accounting actally saves the cycles.


Of course, that does not prove it's well a well thought-out design,
although it seems sensible to me. It's the least complex accounting
implementation I can think of.

For example, I don't think being able to create nonsensical hierarchies
is a major flaw. We certainly should not do that in our code, and we
should take reasonable precautions to prevent this in a user code (as
for example inheriting the tracking flag in MemoryContextCreate, as Jeff
does). Can this be seen as a foot gun? Maybe, but we're handing
ammunition to users on a daily basis - for example they can switch to
TopMemoryContext, but that does not make memory contexts bad.

But maybe the inheritance really is not necessary - maybe it would be
enough to track this per-context, and then just walk through the
contexts and collect this. Because my observation is that most of the
time is actually spent in walking through blocks and freelists.

Then the issues with nonsensical hierarchies would disappear, and also
the possible overhead with updating all parent hierarchies would go away
(not sure if it's even measurable, though).

regards
Tomas





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers