Re: [HACKERS] 9.5 CF1

2014-07-13 Thread Abhijit Menon-Sen
Hi.

We're now at the "nearly done" stage of the first 9.5 CF.

Twenty-four patches have been committed so far, and five more are ready
for committer (and of those, four are small). Thirty patches are still
marked "needs review", and twenty-one are waiting on author.

I'll go through the patches and follow up or move them to the 2014-08
CF as appropriate, so that this one can be closed out as scheduled.

Many thanks to everyone for their participation in this CF.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.4 documentation: duplicate paragraph in logical decoding example

2014-07-13 Thread Christoph Moench-Tegeder
## Andres Freund (and...@2ndquadrant.com):

> Care to submit a patch for it Christoph?

There it is.

Regards,
Christoph

-- 
Spare Space
diff --git a/doc/src/sgml/logicaldecoding.sgml b/doc/src/sgml/logicaldecoding.sgml
index 41b63b4..6c3707c 100644
--- a/doc/src/sgml/logicaldecoding.sgml
+++ b/doc/src/sgml/logicaldecoding.sgml
@@ -114,7 +114,7 @@ postgres=# SELECT * FROM pg_logical_slot_peek_changes('regression_slot', NULL, N
  0/16E0B90 | 690 | COMMIT 690
 (3 rows)
 
-postgres=# -- You can also peek ahead in the change stream without consuming changes
+postgres=# -- The next call to pg_logical_slot_peek_changes() returns the same changes again
 postgres=# SELECT * FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL);
  location  | xid | data
 ---+-+---

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use unique index for longer pathkeys.

2014-07-13 Thread Amit Kapila
On Fri, Jun 13, 2014 at 1:11 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:
>
> Hello, This is the continuation from the last CF.
>
> This patch intends to make PG to use index for longer pathkeys
> than index columns when,
>
>  - The index is a unique index.
>  - All index columns are NOT NULL.
>  - The index column list is a subset of query_pathkeys.
>
> 
>
> This patch consists of mainly three parts.
>
>  1. Mark index as 'Uniquely orders the table' if the conditions
> are satisfied. This is the same from the previous patch.
>
>  2. Collect the "common primary pathkeys", which is pathkeys that
> can perfectly sort all involved
> RTE's. (collect_common_primary_pathkeys())
>
>  3. Trim the query pathkeys using the common primary pathkeys.
> (trim_query_pathkeys())

I have spent some time on this patch and would like to share my
findings as below with you.

1.
It seems to me that the new flag (IndexOptInfo.full_ordered) introduced in
this patch is not getting set properly.  Please refer below test:

create table t (a int, b int, c int, d text);
create unique index i_t_pkey on t(a, b);
insert into t (select a % 10, a / 10, a, 't' from generate_series(0,
10) a);
analyze;
explain (costs off, analyze on) select distinct * from t;

Unptached -
postgres=# explain (costs off, analyze on) select distinct * from t;
QUERY PLAN
---
 Unique (actual time=331.624..597.278 rows=11 loops=1)
   ->  Sort (actual time=330.889..430.449 rows=11 loops=1)
 Sort Key: a, b, c, d
 Sort Method: external merge  Disk: 2344kB
 ->  Seq Scan on t (actual time=0.013..74.202 rows=11 loops=1)
 Execution time: 665.332 ms
(6 rows)

Patch (pathkey_and_uniqueindx_typ2_v1) -
postgres=# explain (costs off, analyze on) select distinct * from t;
QUERY PLAN
---
 Unique (actual time=336.033..601.144 rows=11 loops=1)
   ->  Sort (actual time=336.027..436.621 rows=11 loops=1)
 Sort Key: a, b, c, d
 Sort Method: external merge  Disk: 2344kB
 ->  Seq Scan on t (actual time=0.014..78.826 rows=11 loops=1)
 Execution time: 667.387 ms
(6 rows)

Shouldn't above query use index scan after patch?

2.
typedef struct IndexOptInfo
  bool predOK; /* true if predicate matches query */
  bool unique; /* true if a unique index */
  bool immediate; /* is uniqueness enforced immediately? */
+ bool full_ordered; /* don't key columns duplicate? */

It seems you have forgotten to update out function _outIndexOptInfo().

3.
+ root->distinct_pathkeys =
+ shorten_pathkeys_following(root, root->distinct_pathkeys,pk_guide,
+   true);
+
+ root->query_pathkeys =
+ shorten_pathkeys_following(root,root->query_pathkeys,pk_guide,
+   true);

Add a space after each parameter in function call.

4.
+PathKeysComparison
+compare_pathkeys_ignoring_strategy(List *keys1, List *keys2)

a. This function return 4 different type of values (PATHKEYS_EQUAL,
   PATHKEYS_DIFFERENT, PATHKEYS_BETTER1, PATHKEYS_BETTER2),
   however the caller just checks for PATHKEYS_EQUAL, isn't it better to
just
   return either PATHKEYS_EQUAL or PATHKEYS_DIFFERENT.
b. Another idea could be that instead of writting separate function,
   pass an additional parameter to compare_pathkeys() to distinguish
   for ignore strategy case.
c. In any case, I think it is good to add comments on top of newly
   introduced function (compare_pathkeys_ignoring_strategy) or extend
   the comments of compare_pathkeys() if you want to add a new parameter
   to it.

> Finally, the new patch has grown up to twice in size.. Although
> it seems a bit large, I think that side-effects are clearly
> avoided.

I am bit worried about the extra cycles added by this patch as compare
to your previous patch; example
During trimming of paths, it will build index paths (build_index_pathkeys())
which it will anyway do again during creation of index paths:
create_index_paths()->get_index_paths()->build_index_paths()->build_index_pathkeys()

I am not sure if this has direct impact, however I am suspecting trimming
logic can add quite a few instructions even for cases when it is not
beneficial.
At this moment, I also don't have better idea, so lets proceed with review
of this
patch unless there is any other better way to achieve it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-13 Thread Stephen Frost
Tomas,

* Tomas Vondra (t...@fuzzy.cz) wrote:
> On 6.7.2014 17:57, Stephen Frost wrote:
> > * Tomas Vondra (t...@fuzzy.cz) wrote:
> >> I can't find the thread / test cases in the archives. I've found this
> >> thread in hackers:
> >>
> >> http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com
> >>
> >> Can you point me to the right one, please?
> > 
> > This:
> > 
> > http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net
> 
> Sadly, the link to the SQL does not work :-(
> 
>   http://snowman.net/~sfrost/test_case.sql => 404
> 
> > and I think there may have been other test cases on that thread
> > somewhere...  I certainly recall having at least a couple of them that I
> > was playing with.
> > 
> > I can probably find them around here somewhere too, if necessary.
> 
> If you could look for them, that'd be great.

cc'ing me directly helps me notice these...

I've added back test_case.sql and test_case2.sql.  Please check them out
and let me know- they're real-world cases we've run into.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [JDBC] [HACKERS] Setting PG-version without recompiling

2014-07-13 Thread Craig Ringer
On 07/03/2014 10:16 PM, Tom Lane wrote:
> Andreas Joseph Krogh  writes:
>> Hi. Â  I'm up for testing 9.4 but my JDBC-driver fails to connect due to 
>> PG's 
>> minor-version string: "4beta1". Is it possible to set this somewhere without 
>> recompiling PG?
> 
> No, and even if you could, that would be the wrong approach.  The right
> approach is to fix the JDBC driver to not complain about such version
> strings.  I'm a bit surprised they haven't done so long since, considering
> how long PG beta versions have been tagged like that.  For that matter,
> they really ought not complain about strings like "9.5devel" or
> "9.5alpha2" either.

Yeah, that's  horrible. We should be using server_version_num .

It should be a quick enough fix, I'll take a look.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] better atomics - v0.5

2014-07-13 Thread Andres Freund
On 2014-07-10 08:46:55 +0530, Amit Kapila wrote:
> As per my understanding of the general theory around barriers,
> read and write are defined to avoid reordering due to compiler and
> full memory barriers are defined to avoid reordering due to
> processors.

No, that's not the case. There's several processors where write/read
barriers are an actual thing on the hardware level.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] better atomics - v0.5

2014-07-13 Thread Andres Freund
Hi Amit,

On 2014-07-08 11:51:14 +0530, Amit Kapila wrote:
> On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila 
> wrote:
> > On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund 
> wrote:
> 
> Further review of patch:
> 1.
> /*
>  * pg_atomic_test_and_set_flag - TAS()
>  *
>  * Acquire/read barrier semantics.
>  */
> STATIC_IF_INLINE_DECLARE bool
> pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
> 
> a. I think Acquire and read barrier semantics aren't equal.

Right. It was mean as Acquire (thus including read barrier) semantics. I
guess I'll better update README.barrier to have definitions of all barriers.

> b. I think it adheres to Acquire semantics for platforms where
> that semantics
> are supported else it defaults to full memory ordering.
> Example :
> #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
> 
> Is it right to declare that generic version of function adheres to
> Acquire semantics.

Implementing stronger semantics than required should always be
fine. There's simply no sane way to work with the variety of platforms
we support in any other way.

> 
> 2.
> bool
> pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
> {
> return TAS((slock_t *) &ptr->sema);
> #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Where's that from? I can't recall adding a line of code like that?

> a. In above usage (ptr->sema), sema itself is not declared as volatile,
> so would it be right to use it in this way for API
> (InterlockedCompareExchange)
> expecting volatile.

Upgrading a pointer to volatile is defined, so I don't see the problem?

> 3.
> /*
>  * pg_atomic_unlocked_test_flag - TAS()
>  *
>  * No barrier semantics.
>  */
> STATIC_IF_INLINE_DECLARE bool
> pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
> 
> a. There is no setting in this, so using TAS in function comments
> seems to be wrong.

Good point.

> b. BTW, I am not able see this function in C11 specs.

So? It's needed for a sensible implementation of spinlocks ontop of
atomics.

> 4.
> #if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) &&
> defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
> ..
> #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> static inline bool
> pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> {
> return pg_atomic_read_u32_impl(ptr) == 1;
> }
> 
> 
> #elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) &&
> defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
> ..
> #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> static inline bool
> pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> {
> return (bool) pg_atomic_read_u32_impl(ptr);
> }
> 
> Is there any reason to keep these two implementations different?

No, will change.

> 5.
> atomics-generic-gcc.h
> #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> static inline bool
> pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> {
> return ptr->value == 0;
> }
> #endif
> 
> Don't we need to compare it with 1 instead of 0?

Why? It returns true if the lock is free, makes sense imo? Will add
comments to atomics.h

> 6.
> atomics-fallback.h
> pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> {
> /*
>  * Can't do this in the semaphore based implementation - always return
>  * true. Do this in the header so compilers can optimize the test away.
>  */
> return true;
> }
> 
> Why we can't implement this in semaphore based implementation?

Because semaphores don't provide the API for it?

> 7.
> /*
>  * pg_atomic_clear_flag - release lock set by TAS()
>  *
>  * Release/write barrier semantics.
>  */
> STATIC_IF_INLINE_DECLARE void
> pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
> 
> a. Are Release and write barriers equivalent?

Same answer as above. Meant as "Release (thus including write barrier) 
semantics"

> b. IIUC then current code doesn't have release semantics for unlock/clear.

If you're referring to s_lock.h with 'current code', yes it should have:
 *  On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
 *  S_UNLOCK() macros must further include hardware-level memory fence
 *  instructions to prevent similar re-ordering at the hardware level.
 *  TAS() and TAS_SPIN() must guarantee that loads and stores issued after
 *  the macro are not executed until the lock has been obtained.  
Conversely,
 *  S_UNLOCK() must guarantee that loads and stores issued before the macro
 *  have been executed before the lock is released.

> We are planing to introduce it in this patch, also this doesn't follow the
> specs which says that clear should have memory_order_seq_cst ordering
> semantics.

Making it guarantee full memory barrier semantics is a major performance
loss on x86-64, so I don't want to do that.

Also there is atomic_flag_test_and_set_explicit()...

> As per its current usage in patch (S_UNLOCK), it seems reasonable
> to have *release* semantics for this API, however I think if we use it for
> generic purpose then it might not be right to define it w

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-13 Thread Tomas Vondra
On 3.7.2014 19:36, Tomas Vondra wrote:
> On 1.7.2014 01:24, Tomas Vondra wrote:
>> On 30.6.2014 23:12, Tomas Vondra wrote:
>>> Hi,
>>
>> Hopefully I got it right this time. At least it seems to be working
>> for cases that failed before (no file leaks, proper rowcounts so
>> far).
> 
> Attached v7, fixing nbatch/ntuples in an assert.
> 
> regards
> Tomas

Attached is v8 of this patch, significantly reworked based on the
related discussion (mostly in 'tweaking NTUP_PER_BUCKET').

With the patch applied, the hash table works like this:

initial sizing (ExecChooseHashTableSize)

- size nbuckets for NTUP_PER_BUCKET=1
- account for memory allocated for buckets

building the hash table
---
- while adding tuples, keep track of optimal number of buckets (for the
  current number of tuples per batch)
- don't resize until all the tuples are added (by increasing nbatch the
  number of buckets may decrease)
- after adding all tuples, consider resizing (optimal nbuckets !=
  initial nbuckets)
- do the resize

This means that for well estimated plans (reasonably exact tuple count
for the inner relation), the hash table is properly sized and no resize
is needed.

For badly estimated plans, this works about the same as the previous
patches (we're accounting for memory needed for buckets etc.).


More batches or less buckets?
-
In the related threads, I repeatedly mentioned that I don't know a good
answer to "Should I add more batches or decrease the number of buckets?"
while sizing the hash table. The current code (as in HEAD) does not face
this dillema because (a) it does not count buckets into work_mem and (b)
does not allow changing nbuckets.

I still don't have a good answer to this question, but I think the patch
can stand even without it. If you want more/less batches, use
smaller/larger work_mem - same as with the current code. Also, with the
'dense allocation' patch [1], it's possible to use larger work_mem
values (and thus compensate for counting buckets into work_mem).


Changes since v7


  (a) remove NTUP_GROW_THRESHOLD (just use NTUP_PER_BUCKET directly)
  (b) set NTUP_PER_BUCKET=1
  (c) include buckets (optimal) when considering nbatch increase
  (d) track optimal number of buckets (for NTUP_PER_BUCKET)
  (e) after loading all the tuples, resize the hash table to get
  nbuckets_optimal (if needed)
  (f) renamed GUC to enable_hash_resize (makes a bit more sense)


Comments


  (a) NTUP_GROW_THRESHOLD was overcomplicated and mostly a result of
  misunderstanding how NTUP_PER_BUCKET works (it's upper threshold,
  not average) and is not really needed.

  (b) The value "1" gives the best performance, but also requires more
  memory for the buckets. This should be more than compensated by
  densely allocating the tuples (see hashjoin-alloc-v3.patch
  in the "tweaking NTUP_PER_BUCKET" thread [1]).

  (c,d) Originally, the memory needed for buckets was not included in
'spaceUsed', but because the NTUP_PER_BUCKET=1 change this is
not really acceptable (needs much more memory than before).
So now it's included both in the initial sizing of the hash
table, and when adding more tuples (nbuckets_optimal).

Still, spaceUsed does not include palloc overhead (which may be
a significant amount of memory). This is tackled by [1].

  (e) No change here.

  (f) A bit better GUC name. Anyway, this is for easier development
  only, and won't be included in the final patch.


regards
Tomas

[1] http://www.postgresql.org/message-id/53c2decc.2010...@fuzzy.cz
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable-

Re: [HACKERS] SSL information view

2014-07-13 Thread Magnus Hagander
On Sun, Jul 13, 2014 at 10:32 PM, Stefan Kaltenbrunner
 wrote:
> On 07/12/2014 03:08 PM, Magnus Hagander wrote:
>> As an administrator, I find that you fairly often want to know what
>> your current connections are actually using as SSL parameters, and
>> there is currently no other way than gdb to find that out - something
>> we definitely should fix.
>
> Yeah that would be handy - however I often wish to be able to figure
> that out based on the logfile as well, any chance of getting these into
> connection-logging/log_line_prefix as well?

We do already log some of it if you have enabled log_connections -
protocol and cipher. Anything else in particular you'd be looking for
- compression info?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSL information view

2014-07-13 Thread Stefan Kaltenbrunner
On 07/12/2014 03:08 PM, Magnus Hagander wrote:
> As an administrator, I find that you fairly often want to know what
> your current connections are actually using as SSL parameters, and
> there is currently no other way than gdb to find that out - something
> we definitely should fix.

Yeah that would be handy - however I often wish to be able to figure
that out based on the logfile as well, any chance of getting these into
connection-logging/log_line_prefix as well?



Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-13 Thread Tomas Vondra
On 11.7.2014 19:25, Tomas Vondra wrote:
> 2) walking through the tuples sequentially
> --
> 
> The other option is not to walk the tuples through buckets, but by
> walking throught the chunks - we know the tuples are stored as
> HashJoinTuple/MinimalTuple, so it shouldn't be that difficult. And by
> walking the tuples in the order they're stored, we may just rearrange
> the tuples in already allocated memory. And maybe it could be even
> faster than the current code, because of the sequential access to the
> memory (as opposed to the random access through buckets) and CPU caches.
> So I like this approach - it's simple, clean and shouldn't add any
> overhead (memory or CPU).

So, attached is a patch that implements this. It's not very complicated
and the resulting code is surprisingly readable (IMHO comprable to the
previous code).

Basics of the patch:

(1) adds HashChunkData structure - a buffer used for dense allocation
of tuples, 16kB by default

(2) adds 'chunks_batch' into HashJoinTableData, which is a linked list
of HashChunkData

(3) the allocation itself is done in chunk_alloc - in short it either
allocates the tuple in the current chunk, or adds a new one if
needed

(4) ExecHashTableInsert calls chunk_alloc (instead of the regular
MemoryContextAlloc)

(5) the main change is in ExecHashIncreaseNumBatches, which now can't
do pfree (does not work with dense allocation), so it reads the
tuples from chunks directly and simply rebuilds the buckets
from scratch (thanks to this we only need one more chunk of memory,
not a complete copy, and we don't need to duplicate buckets at all)

>From the tests I've done so far this seems to work perfectly - it still
saves the memory overhead, and I've seen no difference in performance.

The current patch only implemnents this for tuples in the main hash
table, not for skew buckets. I plan to do that, but it will require
separate chunks for each skew bucket (so we can remove it without
messing with all of them). The chunks for skew buckets should be smaller
(I'm thinking about ~4kB), because there'll be more of them, but OTOH
those buckets are for frequent values so the chunks should not remain empty.

But let's see if the current patch misses something important first.


regards
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..5f9f2df 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
 
 /* 
  *		ExecHash
@@ -130,6 +131,9 @@ MultiExecHash(HashState *node)
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
 
+	/* print some context stats */
+	MemoryContextStats(hashtable->batchCxt);
+	
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -223,7 +227,6 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
-
 /* 
  *		ExecHashTableCreate
  *
@@ -294,6 +297,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->spaceAllowedSkew =
 		hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable->chunks_batch = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -556,10 +561,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	HashChunk	oldchunks = hashtable->chunks_batch;
 
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable->growEnabled)
@@ -612,51 +617,70 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i < hashtable->nbuckets; i++)
-	{
-		HashJoinTuple prevtuple;
-		HashJoinTuple tuple;
+	/* we're going to scan through the chunks directly, not buckets, so we can reset the
+	 * buckets and reinsert all the tuples there - just set them to NULL */
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
 
-		prevtuple = NULL;
-		tuple = hashtable->buckets[i];
+	/* reset the chunks (we have a copy in oldchunks) - this way we'll allocate */
+	hashtable->chunks_batch = NULL;
 
-		while (tuple != NULL)
-		{
-			/* save link in case we delete */
-			HashJoinTuple nexttuple = tuple->next;
-			int			bucketno;
-			int			batchno;
+	/* so, let's scan through the old chunks, and all tuples in each chunk */
+	while (oldchunks != NU

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-13 Thread Tomas Vondra
On 13.7.2014 12:27, Simon Riggs wrote:
> On 12 July 2014 12:43, Tomas Vondra  wrote:
> 
>>> So lets just this change done and then do more later.
>>
>> There's no way back, sadly. The dense allocation turned into a 
>> challenge. I like challenges. I have to solve it or I won't be able
>> to sleep.
> 
> I admire your tenacity, but how about we solve the first challenge 
> first and the second one second?

Yeah, I know. However the two patches deal with the same part of the
code, so it feels natural to work on both at the same time. And I
already had a solution in my head, and "only" needed to code it.

Also, I think the question of allocated memory / memory pressure is
important, and I'd like to avoid "this will be solved by the other
patch" only to find out that the other patch does not work.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] things I learned from working on memory allocation

2014-07-13 Thread Andres Freund
Hi Robert,

On 2014-07-07 15:57:00 -0400, Robert Haas wrote:
> 1. I tried to write a single allocator which would be better than
> aset.c in two ways: first, by allowing allocation from dynamic shared
> memory; and second, by being more memory-efficient than aset.c.
> Heikki told me at PGCon that he thought that was a bad idea, and I now
> think he was right.

I agree with that as well. I am wondering whether it's possible to use
most of the same code and compile it once as a per process allocator and
once as the allocator for shared memory.

> Allocating from dynamic shared memory requires
> the use of relative rather than absolute pointers (because the DSM
> could be mapped at different addresses in different processes) and, if
> concurrent allocating and freeing is to be supported, locking.
> Although neither of these things requires very many extra
> instructions, the common path for aset.c is extremely short, and we
> haven't got those instructions to spare.  Code like "if (lock != NULL)
> LWLockAcquire(lock, LW_EXCLUSIVE)" is quite expensive even if the
> branch is never taken, apparently because being prepared to call a
> function at that point requires storing more stuff in our stack frame,
> and extra instructions to save and restore registers are a painfully
> expensive on allocation-heavy workloads.  On PPC64, to match aset.c's
> performance, a palloc/pfree cycle must run in ~120 instructions, ~80
> cycles, which doesn't leave much room for fluff.

The actual if (lock != NULL) bit costs significant amounts of cycles?
I'd have assumed that branch prediction takes care of that. Or is it
actually the icache not keeping up? Did you measure icache vs. dcache
misses?
Have you played with unlikely()/likely() type of macros?

I don't think that any such fixes changes will make enough of a
difference to negate the point that these simply are two different
allocators with different requirements, but it's still interesting.

> 2. After ripping the relative-pointer and locking stuff out of my
> allocator, I came up with something which performs about like aset.c
> on allocation, but with significantly better memory efficiency on
> small allocations.  REINDEX pgbench_accounts_pkey saves about 22%,
> because the SortTuple array saves nothing but the individual tuples
> take only half as much space, by avoiding the StandardChunkHeader
> overhead.  This seems like a savings worth pursuing, if we can figure
> out how to get it.

Yes, that'd certainly be nice.

> Unfortunately, there's some incremental overhead
> in pfree, amounting to a single test of global variable, even when the
> new allocator is never used, which is measurable on a microbenchmark;
> I don't remember the exact numbers off-hand.

Hm. Why's that needed? Because you're searching for your allocator's
superblocks in pfree()? I guess you don't store information about the
size/context for every allocatation like aset.c does?

> When the new allocator
> *is* used, pfree gets a lot slower - mostly because one of the data
> structures used by the new allocator suffer occasional cache misses
> during free operations.  I think there might be an argument that some
> hit on pfree would be worth taking in exchange for better
> space-efficiency, because we mostly free contexts by resetting them
> anyway; but I haven't yet managed to make that hit small enough to
> feel especially good about it.

The speed bump is hit when pfree()ing a allocation that's been allocated
with the new allocator or also when there's any concurrent activity for
that allocator?

> 3. The current MemoryContext abstraction layer is inadequate for
> almost everything one might want to do.  The idea of a context that
> allows only allocation, and ignores requests to free memory, has been
> discussed more than once.  Such an allocator could skip the
> power-of-two rounding that aset.c does, but it couldn't omit the chunk
> header, which means that in practice it would save no memory at all
> for cases like the one mentioned above (REINDEX
> pgbench_accounts_pkey).

I personally think that the performance benefit of not doing the
rounding, not accessing the freelist, and such is more interesting for
such an allocator than the memory savings. I'm pretty sure such a
'allocation only' allocator for e.g. the parser and parts of the
executor would be a rather noticeable performance benefit.

But even for the cases where the space savings are the important bit: To
me it sounds feasible to declare that some allocators don't allow
reallocations and freeing. Those then would be allowed to omit the chunk
header.  To make that debuggable we would need to add a chunk header
when assertions are enabled to error out when such an operation is
performed - but that's a acceptable price imo.

Btw, am I the only one that wondered if it  wouldn't be rather
beneficial to make aset.c add the chunk header before rounding?

> While there might be some benefit in other
> cases, without getting rid of or at least

Re: [HACKERS] Allowing NOT IN to use ANTI joins

2014-07-13 Thread Tom Lane
David Rowley  writes:
> I had another look at this and it appears you were right the first time, we
> need to ensure there's no NULLs on both sides of the join condition.

Ugh.  I'm back to being discouraged about the usefulness of the
optimization.

> The only other way I could imagine fixing this would be to have some other
> sort of join type that always met the join condition if the right side of
> the join had no tuples... Of course I'm not suggesting it gets implemented
> this way, I'm just otherwise out of ideas.

IIRC, we looked into implementing a true NOT IN join operator years ago.
Not only is it messy as can be, but there are patents in the area :-(.
So anything more than the most brain-dead-simple approach would be
risky.

I could see implementing a variant join operator in the hash join code,
since there you get to look at the entire inner relation before you have
to give any answers.  You could easily detect both empty-inner and
inner-contains-nulls and modify the results of matching appropriately.
However, it's not apparent how that could be made to work for either
mergejoin or nestloop-with-inner-index-scan, which greatly limits the
usefulness of the approach.  Worse yet, I think this'd be at best a
marginal improvement on the existing hashed-subplan code path.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allowing NOT IN to use ANTI joins

2014-07-13 Thread Andres Freund
On 2014-07-13 23:06:15 +1200, David Rowley wrote:
> I had another look at this and it appears you were right the first time, we
> need to ensure there's no NULLs on both sides of the join condition.

The patch is currently marked as 'ready for committer' - that doesn't
seem to correspond to the discussion. Marked as 'Waiting for Author'. Ok?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allowing NOT IN to use ANTI joins

2014-07-13 Thread David Rowley
On Fri, Jul 11, 2014 at 1:11 PM, Tom Lane  wrote:

> I wrote:
> > We could no doubt fix this by also insisting that the left-side vars
> > be provably not null, but that's going to make the patch even slower
> > and even less often applicable.  I'm feeling discouraged about whether
> > this is worth doing in this form.
>
> Hm ... actually, there might be a better answer: what about transforming
>
>WHERE (x,y) NOT IN (SELECT provably-not-null-values FROM ...)
>
> to
>
>WHERE  AND x IS NOT NULL AND y IS NOT NULL
>
> ?
>
>
I had another look at this and it appears you were right the first time, we
need to ensure there's no NULLs on both sides of the join condition.

The reason for this is that there's a special case with "WHERE col NOT
IN(SELECT id from empty_relation)", this is effectively the same as "WHERE
true", so we should see *all* rows, even ones where col is null. Adding a
col IS NOT NULL cannot be done as it would filter out the NULLs in this
special case.

The only other way I could imagine fixing this would be to have some other
sort of join type that always met the join condition if the right side of
the join had no tuples... Of course I'm not suggesting it gets implemented
this way, I'm just otherwise out of ideas.

 Regards

David Rowley


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-13 Thread Simon Riggs
On 12 July 2014 12:43, Tomas Vondra  wrote:

>> So lets just this change done and then do more later.
>
> There's no way back, sadly. The dense allocation turned into a
> challenge. I like challenges. I have to solve it or I won't be able to
> sleep.

I admire your tenacity, but how about we solve the first challenge
first and the second one second?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers