date:20161004


Hi,

After collecting a lot more results from multiple kernel versions, I can 
confirm that I see a significant improvement with 128 and 192 clients, 
roughly by 30%:


   64128192

 master 62482  43181  50985
 granular-locking   61701  59611  47483
 no-content-lock62650  59819  47895
 group-update   63702  64758  62596

But I only see this with Dilip's workload, and only with pre-4.3.0 
kernels (the results above are from kernel 3.19).


With 4.5.5, results for the same benchmark look like this:

   64128192

 master 35693  39822  42151
 granular-locking   35370  39409  41353
 no-content-lock36201  39848  42407
 group-update   35697  39893  42667

That seems like a fairly bad regression in kernel, although I have not 
identified the feature/commit causing it (and it's also possible the 
issue lies somewhere else, of course).


With regular pgbench, I see no improvement on any kernel version. For 
example on 3.19 the results look like this:


   64128192

 master 54661  61014  59484
 granular-locking   55904  62481  60711
 no-content-lock56182  62442  61234
 group-update   55019  61587  60485

I haven't done much more testing (e.g. with -N to eliminate collisions 
on branches) yet, let's see if it changes anything.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix checkpoint skip logic on idle systems by tracking LSN progress

(Squashing replies)

On Fri, Sep 30, 2016 at 6:13 PM, Michael Paquier
 wrote:
> On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
>  wrote:
>> On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
>>  wrote:
>>> At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>>>  wrote in 
>>> <20160930.140015.150178454.horiguchi.kyot...@lab.ntt.co.jp>
 I don't see no problem in setting progressAt in XLogInsertRecord.
 But I doubt GetProgressRecPtr is harmful, especially when
>>>
>>> But I suspect that GetProgressRecPtr could be harmful.
>>
>> Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
>> nproc and reducing checkpoint_timeout. That's what I did but..
>
> Note: I am moving this patch to next CF.

And I am back on it more seriously... And I am taking back what I said upthread.

I looked at the v12 that Horiguchi-san has written, and that seems
correct to me. So I have squashed everything into a single patch,
including the log entry that gets logged with log_checkpoints. Testing
with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
triggering manual activity with CREATE TABLE/whatever and manual
pg_switch_xlog(), I am able to see that checkpoints can be correctly
skipped or generated.

There was as well a compilation error with ereport(). Not sure how I
missed that... Likely too many patches handled these days.

I have also updated the description of archive_timeout that increasing
checkpoint_timeout would reduce unnecessary checkpoints on a idle
system. With this patch, on an idle system checkpoints are just
skipped as they should.

How does that look?
-- 
Michael

hs-checkpoints-v14.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] strange case of kernel performance regression (4.3.x and newer)

Hi,

Over the past couple of days I've been doing a bit of benchmarking for
the "group clog" patch [1], and I've ran into what I suspect might be a
fairly serious performance regression on newer kernels (essentially
4.3.0 and newer). I got to a point where I need help with further
investigation, testing on other systems etc.

The workload tested in the patch [1] is quite simple - a transaction
with 3 x SELECT FOR UPDATE queries and 2 x SAVEPOINT on unlogged tables.
The results (average tps from 5 x 5 minute runs, for 32 and 64 clients)
on multiple kernels look like this:

kernel 32 64
-
3.19.8 4852459291
4.1.33 4719359574
4.2.84890159877
4.3.03218738970
4.3.63188938815
4.4.03194637702
4.4.23 3149837724
4.5.53153137351
4.7.63285938490

Notice the sudden drop from ~50k to ~30k tps between 4.2.8 and 4.3.0
(for 32 clients) and from 60k to 40k (for 64 clients). See the attached
kernel-regression-e5-4620.png.

Those results are from a 4-socket machine, with e5-4620 CPUs, so 32
physical cores in total, 64 with HT. The CPU is v1 model (Sandy Bridge
EP, releases in 2012 and discontinued in Q2 2015), so not particularly
new or obsolete.

This is on scale 300, which easily fits into RAM on the machine. The
results are very stable and IMHO quite consistent.

I've also done some tests with regular pgbench (both read-only and
read-write), with WAL-logged tables, and the results are quite similar.

type kernel 32 64

ro3.19.85579681563
4.4.233818850983

rw3.19.83228246234
4.4.232336731311

I've tried to reproduce the issue on another machine, but without much
success. This machine however only has 2 sockets and much newer CPU
(e5-2620 v4, so Broadwell, released Q1 2016).

So it might be somewhat related to the older CPU, maybe a slightly
different kernel config, or something else.

If you have access to similar machines (2 or 4 sockets), it'd be very
helpful if you could repeat the benchmarks and report the results, to
confirm (or invalidate) my results.

The test scripts (and results), and kernel configs are available here:

https://bitbucket.org/tvondra/kernel-perf-regression

It's nothing fancy, mostly trivial shell scripts (you'll need to modify
some paths in those, I guess). Testing a single kernel version takes
roughly 1h.

Those are vanilla kernels, BTW - no customized distribution kernels with
extra patches, etc.

[1]
https://www.postgresql.org/message-id/flat/CAA4eK1+8=X9mSNeVeHg_NqMsOR-XKsjuqrYzQf=icsdh3u4...@mail.gmail.com

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash tables in dynamic shared memory

2016-10-04 Thread Magnus Hagander

On Oct 5, 2016 1:23 AM, "Thomas Munro" 
wrote:
>
> On Wed, Oct 5, 2016 at 12:11 PM, Thomas Munro
>  wrote:
> > On Wed, Oct 5, 2016 at 11:22 AM, Andres Freund 
wrote:
> >>> Potential use cases for DHT include caches, in-memory database objects
> >>> and working state for parallel execution.
> >>
> >> Is there a more concrete example, i.e. a user we'd convert to this at
> >> the same time as introducing this hashtable?
> >
> > A colleague of mine will shortly post a concrete patch to teach an
> > existing executor node how to be parallel aware, using DHT.  I'll let
> > him explain.
> >
> > I haven't looked into whether it would make sense to convert any
> > existing shmem dynahash hash table to use DHT.  The reason for doing
> > so would be to move it out to DSM segments and enable dynamically
> > growing.  I suspect that the bounded size of things like the hash
> > tables involved in (for example) predicate locking is considered a
> > feature, not a bug, so any such cluster-lifetime core-infrastructure
> > hash table would not be a candidate.  More likely candidates would be
> > ephemeral data used by the executor, as in the above-mentioned patch,
> > and long lived caches of dynamic size owned by core code or
> > extensions.  Like a shared query plan cache, if anyone can figure out
> > the invalidation magic required.
>
> Another thought: it could be used to make things like
> pg_stat_statements not have to be in shared_preload_libraries.
>

That would indeed be a great improvement. And possibly also allow the
changing of the max number of statements it can track without a restart?

I was also wondering if it might be useful for a replacement for some of
the pgstats stuff to get rid of the cost of spooling to file and then
rebuilding the hash tables in the receiving end. I've been waiting for this
patch to figure out if that's useful. I mean keep the stats collector doing
what it does now over udp, but present the results in shared hash tables
instead of files.

/Magnus

Re: [HACKERS] Proposal: ON UPDATE REMOVE foreign key action

2016-10-04 Thread Vitaly Burovoy

On 10/4/16, Kirill Berezin  wrote:
> Disclaimer: sorry, i dont understand, should i reply to each of you
> personally, or just answer to channel. Some feedbacks were sended in
> personal, and some include channel copy.

Usually discussions are in the list, therefore you should use "reply
to all" (see [1]).
Exception is when a sender notes "Off the list".

> Thanks for responses, you understand it correctly.
>
> When i said "anybody", i mean inclusive owner himself. For example cookie
> poisoning.
> There is no "another" session, technically. They similar to the server,
> they even can have same IP.
> Yes, we can't prevent it with CSRF cookies, but it is not the point of
> current conversation.
>
> I can make business logic outside table: make extra query.

Good decision. Your case needs exactly what you've just written.

> Im just dont like how it looks from perspective of encapsulation.
> Each table should describe itself, like object in OOP language.
> With SQL constructions or triggers/constraits.

SQL is not OOP. There is no "encapsulation".

> Second part of my use case is data cache.

Hmm. Usage of RDBMS as a cache with an overhead for Isolation and
Durability (from ACID)? Really?
As for me it is a bad idea for most cases.

> When user update
> version(generation), cache should be flushed. As easy example: imagine i am
> fetching currency value. And till end of the day, i am honor current
> course. (Or any other information, that has certain origin checkpoints).
> When i am updating origin state (current day, server version, ip address,
> neural network generation), i am have to invalidate all previous data.

It is a bad example. Companies working with currency exchange rates
always keep their values as historical data.

> Like i am calculating digits of the square root, of some number. The more i
> spend time, the closer my approx value to irrational result. But when
> original value has changed - all previous data does not make sense. I am
> flushing it and starting from digit 1.

Why do you "update" original value instead of deleting old one and
inserting new value?

> This is allegorical examples to my real-world cases. I may try imagine some
> hypothetical situations, when this functionality more welcomed. But, i am
> respect reasons why do not apply this proposal. If my update didn't shift
> the balance, its ok. on update trigger is not such painful.

All your cases (except the exchange rate one) can be done using two
queries: delete original row (which deletes other linked data "ON
DELETE CASCADE") and insert a new one. You don't even have to use
transactions!
If your business logic is so "OOP", you can use stored procedures, but
introducing new grammar specially for concrete task is a bad idea.

Of course at first sight there is a meaningless sequence "ON UPDATE
SET (NULL|DEFAULT)", but the meaning of SET NULL and SET DEFAULT for
both ON UPDATE and ON DELETE is using them for "unlinking" data from
the referenced one. It is similar to "NO ACTION" but explicitly change
them as they are no longer connected to the referenced row (by
referencing column list).

Also your proposal is not consistent: ON UPDATE REMOVE (DELETE?), but
ON DELETE - what? again remove/delete?

[1] https://wiki.postgresql.org/wiki/Mailing_Lists#Using_the_discussion_lists
-- 
Best regards,
Vitaly Burovoy

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Aggregation push down, reorder of join and group by

2016-10-04 Thread Amit Langote

Hi,

On 2016/10/05 14:19, Chenxi Li wrote:
> Hello,
> 
> I'm reading some papers about aggregation push down like in "
> https://ub-madoc.bib.uni-mannheim.de/37228/1/main.pdf"; and "
> http://www.vldb.org/conf/1995/P345.PDF";. I think it is very useful but very
> complex to implement. In some complex queries, it can be a lot faster. Is
> there any plan to do this in the future?

Thanks for the links.  On a quick look, I think you are referring to the
following section in the document at the first link (and the title of the
paper at the second link):

4.2 Applying Eager Aggregation

I think someone is working on something like that.  Check out the
following discussion (currently inactive though):

* Partial Aggregation / GROUP BY before JOIN *
https://www.postgresql.org/message-id/flat/CAKJS1f9kw95K2pnCKAoPmNw%3D%3D7fgjSjC-82cy1RB%2B-x-Jz0QHA%40mail.gmail.com

There is also related work which is under active development:

* Aggregate Push Down - Performing aggregation on foreign server *
https://www.postgresql.org/message-id/flat/CAM2%2B6%3DW%3Dr_vh2gpccPJ4tc%3D%2BhFm4UGQsQdZgt4GiBMYVEk%2B5vg%40mail.gmail.com

In this case, the aggregation step is pushed all the way to the remote
server where the data resides.  It stands to avoid a good deal of
unnecessary data traffic across the network.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is the last 9.1 release planned?

2016-10-04 Thread Magnus Hagander

On Oct 5, 2016 5:42 AM, "Tsunakawa, Takayuki" <
tsunakawa.ta...@jp.fujitsu.com> wrote:
>
> From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Jaime Casanova
> > Well, no. We normally don't give special treatment to any minor release
> > not even if it is going to die.
> > What normally happens is that all minor releases are released the same
day.
> >
> > Taken your example, that same day were released: 9.0.23, 9.1.19, 9.2.14,
> > 9.3.10 and 9.4.5
> >
>
> Thanks for clarification.  Then, I understood that the expression "stop
releases in September" in the release note and a pgsql-announce mail was
not correct.

It basically means stop guaranteeing that we do. As of a couple of days
ago, bug fixes won't necessarily be back ported to 9.1 if they are
difficult. But there will be one wrap-up release in November with any
patches that have already been applied but have not yet been in a release.
And after November, we will stop doing that as well.

/Magnus

[HACKERS] Aggregation push down, reorder of join and group by

2016-10-04 Thread Chenxi Li

Hello,

I'm reading some papers about aggregation push down like in "
https://ub-madoc.bib.uni-mannheim.de/37228/1/main.pdf"; and "
http://www.vldb.org/conf/1995/P345.PDF";. I think it is very useful but very
complex to implement. In some complex queries, it can be a lot faster. Is
there any plan to do this in the future?

Regards

Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2016-10-04 Thread Ashutosh Bapat

>
> I looked at your patch closely.  You added code to track dependencies on
> FDW-related objects to createplan.c, but I think it would be more
> appropriate to put that code in setrefs.c like the attached.  I think
> record_foreignscan_plan_dependencies in your patch would be a bit
> inefficient because that tracks such dependencies redundantly in the case
> where the given ForeignScan has an outer subplan, so I optimized that in the
> attached.  (Also some regression tests for that were added.)

Thanks for fixing the inefficiency.

+   /*
+* Record dependencies on FDW-related objects.  If an outer subplan
+* exists, that would be done in the processing of its baserels, so skip
+* that.
+*/
I think, we need a bit more explanation here. How about rewording it
as "Record dependencies on FDW-related objects. If an outer subplan
exists, the dependencies would be recorded while processing the
foreign plans for the base foreign relations in the subplan. Hence
skip that here."

+ * Currently, we track exactly the dependencies of plans on relations,
+ * user-defined functions and FDW-related objects.  On relcache invalidation
+ * events or pg_proc syscache invalidation events, we invalidate just those
+ * plans that depend on the particular object being modified.  (Note: this
+ * scheme assumes that any table modification that requires replanning will
+ * generate a relcache inval event.)  We also watch for inval events on
+ * certain other system catalogs, such as pg_namespace; but for them, our
+ * response is just to invalidate all plans.  We expect updates on those
+ * catalogs to be infrequent enough that more-detailed tracking is not worth
+ * the effort.

Just like you have added FDW-related objects in the first sentence,
reference to those needs to be added in the second sentence as well.
Or you want to start the second sentence as "When the relevant caches
are invalidated, we invalidate ..." so that those two sentences will
remain in sync even after additions/deletions to the first sentence.
>
> I think it would be a bit inefficient to use PlanCacheFuncCallback as the
> inval callback function for those caches, because that checks the inval-item
> list for the rewritten query tree, but any updates on such catalogs wouldn't
> affect the query tree.

I am not sure about that. Right now it seems that only the plans are
affected, but can we say that for all FDWs?

> So, I added a new callback function for those caches
> that is much like PlanCacheFuncCallback but skips checking the list for the
> query tree.  I updated some comments also.

I am not sure that the inefficiency because of an extra loop on
plansource->invalItems is a good reason to duplicate most of the code
in PlanCacheFuncCallback(). IMO, maintaining that extra function and
the risk of bugs because of not keeping those two functions in sync
outweigh the small not-so-frequent gain. The name of function may be
changed to be more generic, instead of current one referring Func.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: Batch/pipelining support for libpq

On Mon, Oct 3, 2016 at 12:48 PM, Craig Ringer  wrote:
> On 3 October 2016 at 10:10, Michael Paquier  wrote:
>> On Tue, Sep 6, 2016 at 8:01 PM, Craig Ringer  wrote:
>>> On 6 September 2016 at 16:10, Daniel Verite  wrote:
 Craig Ringer wrote:

> Updated patch attached.

 Please find attached a couple fixes for typos I've came across in
 the doc part.
>>>
>>> Thanks, will apply and post a rebased patch soon, or if someone picks
>>> this up in the mean time they can apply your diff on top of the patch.
>>
>> Could you send an updated patch then? At the same time I am noticing
>> that git --check is complaining... This patch has tests and a
>> well-documented feature, so I'll take a look at it soon at the top of
>> my list. Moved to next CF for now.
>
> Thanks.
>
> I'd really like to teach psql in non-interactive mode to use it, but
> (a) I'm concerned about possible subtle behaviour differences arising
> if we do that and (b) I won't have the time. I think it's mostly of
> interest to app authors and driver developers and that's what it's
> aimed at. pg_restore could benefit a lot too.

Looking at it now... The most interesting comments are first.

I wanted to congratulate you. I barely see a patch with this level of
details done within the first versions. Anybody can review the patch
just by looking at the code and especially the docs without looking at
the thread. There are even tests to show what this does, for the
client.

+PQisInBatchMode   172
+PQqueriesInBatch  173
+PQbeginBatchMode  174
+PQendBatchMode175
+PQsendEndBatch176
+PQgetNextQuery177
+PQbatchIsAborted  178
This set of routines is a bit inconsistent. Why not just prefixing
them with PQbatch? Like that for example:
PQbatchStatus(): consists of disabled/inactive/none, active, error.
This covers both PQbatchIsAborted() and PQisInBatchMode().
PQbatchBegin()
PQbatchEnd()
PQbatchQueueSync() or PQbatchQueueFlush, same as PQsendEndBatch() to
add and process a sync message into the queue.
PQbatchQueueCount(): returns N>0 if there are N entries, 0 if empty,
-1 on failure
PQbatchQueueProcess(): returns 1 if process can begin, 0 if not, -1 on
failure (OOM)

+   
+Much like asynchronous query mode, there is no performance disadvantage to
+using batching and pipelining. It somewhat increased client application
+complexity and extra caution is required to prevent client/server network
+deadlocks, but can offer considerable performance improvements.
+   
I would reword that a bit "it increases client application complexity
and extra caution is required to prevent client/server deadlocks but
offers considerable performance improvements".

+("ping time") is high, and when many small operations are being
performed in
Nit: should use  here. Still not quoting it would be fine.

+ After entering batch mode the application dispatches requests
+ using normal asynchronous libpq functions like
+ PQsendQueryParams,
PQsendPrepare,
+ etc. The asynchronous requests are followed by a init;
+$node->start;
+
+my $port = $node->port;
+
+$node->stop('fast');
Er... This does nothing..

The changes in src/test/examples/ are not necessary anymore. You moved
all the tests to test_libpq (for the best actually).

+   while (queue != NULL)
+   {
+   PGcommandQueueEntry *prev = queue;
+   queue = queue->next;
+   free(prev);
+   }
This should free prev->query.

/* blah comment
* blah2 comment
*/
A lot of comment blocks are like that, those should be reformated.

Running directly make check in src/test/modules/test_libpq/ does not work:
# Postmaster PID for node "main" is 10225
# Running: testlibpqbatch dbname=postgres 1 disallowed_in_batch
Command 'testlibpqbatch' not found in [...PATH list ...]
The problem is that testlibpqbatch is not getting compiled but I think
it should.

+ * testlibpqbatch.c
+ * Test of batch execution funtionality
[...]
+   fprintf(stderr, "%s is not a recognised test name\n", argv[3]);
s/funtionality/functionality/
s/recognised/recognized/

+   if (!PQsendQueryParams(conn, "SELECT $1", 1, dummy_param_oids,
+  dummy_params, NULL, NULL, 0))
+   {
+   fprintf(stderr, "dispatching first SELECT failed: %s\n",
PQerrorMessage(conn));
+   goto fail;
+   }
+
+   if (!PQsendEndBatch(conn))
+   {
+   fprintf(stderr, "Ending first batch failed: %s\n",
PQerrorMessage(conn));
+   goto fail;
+   }
+
+   if (!PQsendQueryParams(conn, "SELECT $1", 1, dummy_param_oids,
+  dummy_params, NULL, NULL, 0))
+   {
+   fprintf(stderr, "dispatching second SELECT failed: %s\n",
PQerrorMessage(conn));
+   goto fail;
+   }
May be better to use a loop here and set in the queue a bunch of queries..

You could just remove the VERBOSE flag in the tests, having a test
more talkative is always better.

+
+ The batch API was introduced in PostgreSQL 9.6, b

Re: [HACKERS] Declarative partitioning - another take

2016-10-04 Thread Petr Jelinek

On 05/10/16 03:50, Robert Haas wrote:
> On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
>  wrote:
>> [ latest patch set ]
> 
> Reviewing 0003:
> 
> +  with matching types and no more. Also, it must have all the matching
> +  constraints as the target table.  That includes both NOT 
> NULL
> +  and CHECK constraints.  If some CHECK
> +  constraint of the table being attached is marked NO
> INHERIT,
> +  the command will fail; such constraints must either be dropped or
> +  recreated without the NO INHERIT mark.
> 
> Why all of these requirements?  We could instead perform a scan to
> validate that the constraints are met.  I think the way this should
> work is:

I think it's survivable limitation in initial version, it does not seem
to me like there is anything that prevents improving this in some follow
up patch.

> 
> +  Currently UNIQUE, PRIMARY KEY, 
> and
> +  FOREIGN KEY constraints are not considered, but that
> +  might change in the future.
> 
> Really?  Changing that sounds impractical to me.
> 

Which part of that statement?

> 
> +  Note that if a partition being detached is itself a partitioned table,
> +  it continues to exist as such.
> 
> You don't really need to say this, I think.  All of the properties of
> the detached table are retained, not only its partitioning status.
> You wouldn't like it if I told you to document "note that if a
> partition being detached is unlogged, it will still be unlogged".
> 

I think this is bit different though, it basically means you are
detaching whole branch from the rest of the partitioning tree, not just
that one partition. To me that's worth mentioning to avoid potential
confusion.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-04 Thread Petr Jelinek

On 05/10/16 03:11, Tomas Vondra wrote:
> On 10/04/2016 09:44 PM, John Gorman wrote:
>>
>> Remind me again why we cannot realloc in place for sizes
>> smaller than chunkSize? GenSlab is happy to allocate smaller
>> sizes and put them into the fixed size chunks.
>>
>> Maybe SlabAlloc can be happy with sizes up to chunkSize.
>>
>> if (size <= set->chunkSize)
>> return MemoryContextAlloc(set->slab, size);
>> else
>> return MemoryContextAlloc(set->aset, size);
>>
> 
> That'd be certainly possible, but it seems a bit strange as the whole
> Slab is based on the idea that all chunks have the same size. Moreover,
> people usually realloc() to a larger chunk, so it does not really fix
> anything in practice.
> 
> In GenSlab, the situation is more complicated. But I don't like the
> coupling / moving chunks between contexts, etc.
> 
> If realloc() support is a hard requirement, it immediately rules out
> SlabContext() as an independent memory context. Which seems stupid, as
> independent Slab contexts are quite useful for reorderbuffer use case.
> 
> For GenSlab the situation is less clear, as there probably are ways to
> make it work, but I'd vote to keep it simple for now, and simply do
> elog(ERROR) in the realloc() methods - both for Slab and GenSlab. The
> current use case (reorderbuffer) does not need that, and it seems like a
> can of worms to me.
> 

Hmm, so this in practice means that the caller still has to know the
details of what chunks go where. I would prefer if the realloc just
failed always and "don't do realloc on GenSlab" would be part of spec of
hat context, the randomness that you described originally is the main
problem IMHO. Maybe you could add new "constructor" function for Aset
that would create Aset which can't realloc for use inside the GenSlab?

Alternative would be of course having the individual API calls behind
Aset and Slab exported and used by GenSlab directly instead of using
child contexts. Then all the calls would go to GenSlab which could
decide what to do (and move the chunks between the allocators).

But honestly given the usecases for GenSlab, I would at the moment
prefer just to have predictable error as it can be done more cleanly and
nobody needs the functionality so far, it can be revisited once we
actually do need it.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: two slab-like memory allocators


Hi,

attached is v3 of the patches, with a few minor fixes in Slab, and much 
larger fixes in GenSlab.


Slab (minor fixes)
--
- Removed the unnecessary memset() of new blocks in SlabAlloc(), 
although we still need to zero the free bitmap at the end of the block.
- Renamed minFreeCount to minFreeChunks, added a few comments explaining 
why/how firstFreeChunk and minFreeChunks are maintained.

- Fixed / improved a bunch of additional comments, based on feedback.

GenSlab
---
Fixed a bunch of bugs that made GenSlab utterly useless. Firstly, 
chunkSize was not stored in GenSlabContextCreate(), so this check in 
SlabAlloc()


if (size <= set->chunkSize)
return MemoryContextAlloc(set->slab, set->chunkSize);
else
return MemoryContextAlloc(set->aset, size);

always fell through to the set->aset case, not allocating stuff in the 
Slab at all.


Secondly, nallocations / nbytes counters were not updated at all, so the 
Slab was never recreated, so GenSlab was not really generational.


This only affected 1 of 3 contexts in ReorderBuffer, but apparently 
those "important ones" to affect performance.


Both issues are fixed in the attached v3, which also introduces two 
additional improvements discussed in this thread:


- The chunk size is limited by ALLOCSET_SEPARATE_THRESHOLD, as for large 
chunks AllocSet works just fine (thanks to keeping them out of free list 
etc.)


- Instead of specifying blockSize and chunkSize, GenSlabCreate() now 
accepts three parameters - minBlockSize, minChunkCount and chunkSize, 
and computes the minimum block size (>= minBlockSize), sufficient to 
store minChunkCount chunks, each chunkSize bytes. This works much better 
in the auto-tuning scenario.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


slab-allocators-v3.tgz
Description: application/compressed-tar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Generic type subscription

2016-10-04 Thread Dmitry Dolgov

On 5 October 2016 at 03:00, Oleg Bartunov  wrote:

>
> have you ever run 'make check' ?
>
> =
>  53 of 168 tests failed.
> =
>
>
Sure, how else could I write tests for this feature? But right now on my
machine
everything is ok (the same for `make installcheck`):

$ make check

===
 All 168 tests passed.
===

Re: [HACKERS] Is the last 9.1 release planned?

From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Jaime Casanova
> Well, no. We normally don't give special treatment to any minor release
> not even if it is going to die.
> What normally happens is that all minor releases are released the same day.
> 
> Taken your example, that same day were released: 9.0.23, 9.1.19, 9.2.14,
> 9.3.10 and 9.4.5
> 

Thanks for clarification.  Then, I understood that the expression "stop 
releases in September" in the release note and a pgsql-announce mail was not 
correct.

Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is the last 9.1 release planned?

2016-10-04 Thread Jaime Casanova

On 4 October 2016 at 22:00, Tsunakawa, Takayuki
 wrote:
>
> and 9.0.23 was released in October 8.  So I guessed 9.1.24 will be released 
> in a week or so.
>

Well, no. We normally don't give special treatment to any minor
release not even if it is going to die.
What normally happens is that all minor releases are released the same day.

Taken your example, that same day were released: 9.0.23, 9.1.19,
9.2.14, 9.3.10 and 9.4.5

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is the last 9.1 release planned?

From: Michael Paquier [mailto:michael.paqu...@gmail.com]
> 9.1.24 will be the last in the 9.1 series as far as I know. And it is still
> to come at the beginning of November:
> https://www.postgresql.org/developer/roadmap/

But the release note for 9.1.23 says:

"The PostgreSQL community will stop releasing updates for the 9.1.X release 
series in September 2016. Users are encouraged to update to a newer release 
branch soon."


OTOH, the 9.0.22 release note said:

"The PostgreSQL community will stop releasing updates for the 9.0.X release 
series in September 2015. Users are encouraged to update to a newer release 
branch soon."

and the 9.0.23, which is the last release for 9.0 said:

"This is expected to be the last PostgreSQL release in the 9.0.X series. Users 
are encouraged to update to a newer release branch soon."

and 9.0.23 was released in October 8.  So I guessed 9.1.24 will be released in 
a week or so.

Regards
Takayuki Tsunakawa

 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
 wrote:
> [ latest patch set ]

Reviewing 0003:

+  This form attaches an existing table (partitioned or otherwise) as

(which might itself be partitioned)

+  partition of the target table.  Partition bound specification must
+  correspond with the partition method and the key of the target table.

The partition bound specification must correspond to the partitioning
method and partitioning key of the target table.

+  The table being attached must have all the columns of the target table
+  with matching types and no more. Also, it must have all the matching

The table to be attached must have all of the same columns as the
target table and no more; moreover, the column types must also match.

+  with matching types and no more. Also, it must have all the matching
+  constraints as the target table.  That includes both NOT NULL
+  and CHECK constraints.  If some CHECK
+  constraint of the table being attached is marked NO
INHERIT,
+  the command will fail; such constraints must either be dropped or
+  recreated without the NO INHERIT mark.

Why all of these requirements?  We could instead perform a scan to
validate that the constraints are met.  I think the way this should
work is:

1. ATTACH PARTITION works whether matching NOT NULL and CHECK
constraints are present or not.

2. If all of the constraints are present, and a validated constraint
matching the implicit partitioning constraint is also present, then
ATTACH PARTITION does not scan the table to validate constraints;
otherwise, it does.

3. NO VALIDATE is not an option.

+  Currently UNIQUE, PRIMARY KEY, and
+  FOREIGN KEY constraints are not considered, but that
+  might change in the future.

Really?  Changing that sounds impractical to me.

+  This form detaches specified partition of the target table.  The
+  detached partition continues to exist as a standalone table with no ties
+  remaining with the target table.

continues to exist as a standalone table, but no longer has any ties
to the table from which it was detached.

+  Note that if a partition being detached is itself a partitioned table,
+  it continues to exist as such.

You don't really need to say this, I think.  All of the properties of
the detached table are retained, not only its partitioning status.
You wouldn't like it if I told you to document "note that if a
partition being detached is unlogged, it will still be unlogged".

To add the table as a new child of a parent table, you must own the
-   parent table as well.
+   parent table as well.  That applies to both adding the table as a
+   inheritance child of a parent table and attaching a table as partition to
+   the table.

To add the table as a new child of a parent table, or as a new
partition of a partitioned table, you must own the parent table as
well.

+The name of the table to attach as a new partition to or
detach from this table.

s/to or/or to/

+NO VALIDATE option is spcified.

Typo, but see comments above about nuking this altogether.

 A recursive DROP COLUMN operation will remove a
 descendant table's column only if the descendant does not inherit
 that column from any other parents and never had an independent
-definition of the column.  A nonrecursive DROP
+definition of the column (which always holds if the descendant table
+is a partition).  A nonrecursive DROP
 COLUMN (i.e., ALTER TABLE ONLY ... DROP
 COLUMN) never removes any descendant columns, but
-instead marks them as independently defined rather than inherited.
+instead marks them as independently defined rather than inherited,
+unless the descendant table is a partition.

This is a hairy explanation.  I suggest that the documentation say
this (and that the code implement it): A nonrecursive DROP TABLE
command will fail for a partitioned table, because all partitions of a
table must have the same columns as the partitioning root.

-that are not marked NO INHERIT.
+that are not marked NO INHERIT which are unsupported if
+the table is a partitioned table.

I think you can omit this hunk.

+   If PARTITION OF clause is specified then the table is
+   created as a partition of parent_table with specified
+   bounds.  However, unlike regular tables, one cannot specify
+   PARTITION BY clause which means foreign tables can
+   only be created as leaf partitions.

I'd delete the sentence beginning with "However".

+   Create foreign table measurement_y2016m07, which will be
+   accessed through the server server_07, that is partition
+   of the range partitioned table measurement:

s/, that is/ as a/

+and partition_bound_spec is:
+
+FOR VALUES { list_spec |
range_spec }

I think you can inline the definitions of list_spec and range_spec
here instead of making them separate productions, and I think that
would be preferable.

FOR VALUES { IN ( expression [, ...]

Re: [HACKERS] Move allocation size overflow handling to MemoryContextAllocExtended()?

Andres Freund  writes:
> On 2016-10-05 09:38:15 +0900, Michael Paquier wrote:
>> The existing interface of MemoryContextAlloc do not care much about
>> anything except Size, so I'd just give the responsability to the
>> caller to do checks like queue != (Size) queue when queue is a uint64
>> for example.

> Well, that duplicates the check and error message everywhere.

It seems like you're on the edge of reinventing calloc(), which is not an
API that anybody especially likes.  The problem with trying to centralize
overflow handling is that there are too many different cases.  calloc()
handles exactly one case, where the request is to be a product of exactly
two inputs that are individually not capable of overflowing.  As best I
can follow what you've got in mind here, it would be equally likely to
cover only a subset of cases.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is the last 9.1 release planned?

On Wed, Oct 5, 2016 at 10:09 AM, Tsunakawa, Takayuki
 wrote:
> [Excerpt]
> PostgreSQL version 9.1 will be End-of-Life in September 2016.  The project 
> expects to only release one more update for that version.

9.1.24 will be the last in the 9.1 series as far as I know. And it is
still to come at the beginning of November:
https://www.postgresql.org/developer/roadmap/
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: two slab-like memory allocators

On 10/04/2016 09:44 PM, John Gorman wrote:

SlabContext has a parent context. It can delegate requests that
it cannot handle to the parent context which is GenSlab. Genslab
can send them to the sister aset context.

But Slab may also be used separately, not just as part of GenSlab 
(actually, reorderbuffer has two such contexts). That complicates things 
quite a bit, and it also seems a bit awkward, because:

(a) It'd require a flag in SlabContext (or perhaps a pointer to the 
second context), which introduces coupling between the contexts.

(b) SlabContext was meant to be extremely simple (based on the "single 
chunk size" idea), and this contradicts that a bit.

(c) It'd move chunks between the memory contexts in unpredictable ways 
(although the user should treat it as a single context, and not reset 
the parts independently for example).

>

This could handle all reallocs so there will be no surprises.

Yeah, but it's also

>

Remind me again why we cannot realloc in place for sizes
smaller than chunkSize? GenSlab is happy to allocate smaller
sizes and put them into the fixed size chunks.

Maybe SlabAlloc can be happy with sizes up to chunkSize.

if (size <= set->chunkSize)
return MemoryContextAlloc(set->slab, size);
else
return MemoryContextAlloc(set->aset, size);

That'd be certainly possible, but it seems a bit strange as the whole 
Slab is based on the idea that all chunks have the same size. Moreover, 
people usually realloc() to a larger chunk, so it does not really fix 
anything in practice.

In GenSlab, the situation is more complicated. But I don't like the 
coupling / moving chunks between contexts, etc.

If realloc() support is a hard requirement, it immediately rules out 
SlabContext() as an independent memory context. Which seems stupid, as 
independent Slab contexts are quite useful for reorderbuffer use case.

For GenSlab the situation is less clear, as there probably are ways to 
make it work, but I'd vote to keep it simple for now, and simply do 
elog(ERROR) in the realloc() methods - both for Slab and GenSlab. The 
current use case (reorderbuffer) does not need that, and it seems like a 
can of worms to me.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Is the last 9.1 release planned?

Hello,

(Please point me to the appropriate ML if this is not the right one.)

According to the following mail, I thought one more release for 9.1 (9.1.24) 
was scheduled in September.  Is there any release plan for the 9.1 last 
release?  If there's, I want to wait for it, and apply 9.1.23 otherwise.

https://www.postgresql.org/message-id/1470924187.12735.59.ca...@gunduz.org

[Excerpt]
PostgreSQL version 9.1 will be End-of-Life in September 2016.  The project 
expects to only release one more update for that version.


Regards
Takayuki Tsunakawa



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Feature suggestion: ON CONFLICT DO NOTHING RETURNING which returns existing row data

2016-10-04 Thread Tom Dunstan

Hi all

We recently moved to using 9.5 and were hoping to use the new upsert 
functionality, but unfortunately it doesn’t quite do what we need.

Our setup is something like this:

CREATE TABLE t1 (
  id BIGSERIAL NOT NULL PRIMARY KEY,
  bk1 INT,
  bk2 UUID
  — other columns
);
CREATE UNIQUE INDEX t1_bk ON t1 (bk1, bk2);

CREATE TABLE t2 (
  t1_id BIGINT NOT NULL REFERENCES t1
 — other stuff
);

Data comes in as inserts of one tuple each of t1 and t2. We expect inserts to 
t1 to be heavily duplicated. That is, for stuff coming in we expect a large 
number of rows to have duplicate (bk1, bk2), and we wish to discard those, but 
not discard the t2 tuple - those should always be inserted and reference the 
correct t1 record.

So we currently have an insertion function that does this:

BEGIN
  INSERT INTO t1 (bk1, bk2, other columns)
  VALUES (bk1val, bk2val, other values)
  RETURNING id
  INTO t1_id;
EXCEPTION WHEN unique_violation THEN
  SELECT id
  FROM t1
  WHERE bk1 = bk1val AND bk2 = bk2val
  INTO t1_id;
END;

INSERT INTO t2(t1_id, other columns) VALUES(t1_id, other values);

We were hoping that we’d be able to do something like this:

INSERT INTO t1 (bk1, bk2, other columns)
  VALUES (bk1val, bk2val, other values)
  ON CONFLICT (bk1val, bk2val) DO NOTHING
  RETURNING id
  INTO t1_id;
INSERT INTO t2(t1_id, other columns) VALUES(t1_id, other values);

But unfortunately it seems that the RETURNING clause returns null when there’s 
a conflict, rather than the existing row’s value.

I understand that there is ambiguity if there were multiple rows that were in 
conflict. I think this sort of functionality really only makes sense where the 
conflict target is a unique constraint, so IMO it would make sense to only 
support returning columns in that case.

I imagine that this would be possible to do more efficiently than the 
subsequent query that we are currently doing given that postgres has already 
found the rows in question, in the index at least. I have no idea how hard it 
would actually be to implement though. FWIW my use-case would be supported even 
if this only worked for indexes where the to-be-returned columns were stored in 
the index using Anastasia’s covering + unique index patch, when that lands.

Thoughts?

Tom



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Move allocation size overflow handling to MemoryContextAllocExtended()?

On 2016-10-05 09:38:15 +0900, Michael Paquier wrote:
> On Wed, Oct 5, 2016 at 8:56 AM, Andres Freund  wrote:
> > That made me wonder if it's not actually a mistake for
> > MemoryContextAllocExtended() size parameter to be declared
> > Size/size_t. That prevents it from detecting such overflows, forcing
> > code like the above on callsites.
> >
> > Comments?
> 
> The existing interface of MemoryContextAlloc do not care much about
> anything except Size, so I'd just give the responsability to the
> caller to do checks like queue != (Size) queue when queue is a uint64
> for example.

Well, that duplicates the check and error message everywhere.

And queue != (Size) queue will cause errors like
/home/andres/src/postgresql/src/include/lib/simplehash.h:182:44:
warning: self-comparison always evaluates to false [-Wtautological-compare]

> And I can see that your patch is using uint32 for SH_TYPE->size.

But it multiples the size with sizeof(elemement)...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Move allocation size overflow handling to MemoryContextAllocExtended()?

On Wed, Oct 5, 2016 at 8:56 AM, Andres Freund  wrote:
> That made me wonder if it's not actually a mistake for
> MemoryContextAllocExtended() size parameter to be declared
> Size/size_t. That prevents it from detecting such overflows, forcing
> code like the above on callsites.
>
> Comments?

The existing interface of MemoryContextAlloc do not care much about
anything except Size, so I'd just give the responsability to the
caller to do checks like queue != (Size) queue when queue is a uint64
for example. And I can see that your patch is using uint32 for
SH_TYPE->size.
 --
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Move allocation size overflow handling to MemoryContextAllocExtended()?

Hi,

I was working on making sure that allocations for [1] don't overflow
size_t for large hash tables, when created on 32bit systems.  I started
to write code like

if (sizeof(SH_CONTAINS) * (uint64) tb->size) !=
   sizeof(SH_CONTAINS) * (size_t) tb->size))
{
elog(ERROR, "hash table too large for a 32 bit system");
}

that could code potentially, although somewhat unlikely, be trigger on a
32bit system.


That made me wonder if it's not actually a mistake for
MemoryContextAllocExtended() size parameter to be declared
Size/size_t. That prevents it from detecting such overflows, forcing
code like the above on callsites.

Comments?

- Andres

[1] 
http://archives.postgresql.org/message-id/20160727004333.r3e2k2y6fvk2ntup%40alap3.anarazel.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash tables in dynamic shared memory

On Wed, Oct 5, 2016 at 12:11 PM, Thomas Munro
 wrote:
> On Wed, Oct 5, 2016 at 11:22 AM, Andres Freund  wrote:
>>> Potential use cases for DHT include caches, in-memory database objects
>>> and working state for parallel execution.
>>
>> Is there a more concrete example, i.e. a user we'd convert to this at
>> the same time as introducing this hashtable?
>
> A colleague of mine will shortly post a concrete patch to teach an
> existing executor node how to be parallel aware, using DHT.  I'll let
> him explain.
>
> I haven't looked into whether it would make sense to convert any
> existing shmem dynahash hash table to use DHT.  The reason for doing
> so would be to move it out to DSM segments and enable dynamically
> growing.  I suspect that the bounded size of things like the hash
> tables involved in (for example) predicate locking is considered a
> feature, not a bug, so any such cluster-lifetime core-infrastructure
> hash table would not be a candidate.  More likely candidates would be
> ephemeral data used by the executor, as in the above-mentioned patch,
> and long lived caches of dynamic size owned by core code or
> extensions.  Like a shared query plan cache, if anyone can figure out
> the invalidation magic required.

Another thought: it could be used to make things like
pg_stat_statements not have to be in shared_preload_libraries.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash tables in dynamic shared memory

On Wed, Oct 5, 2016 at 11:22 AM, Andres Freund  wrote:
>> Potential use cases for DHT include caches, in-memory database objects
>> and working state for parallel execution.
>
> Is there a more concrete example, i.e. a user we'd convert to this at
> the same time as introducing this hashtable?

A colleague of mine will shortly post a concrete patch to teach an
existing executor node how to be parallel aware, using DHT.  I'll let
him explain.

I haven't looked into whether it would make sense to convert any
existing shmem dynahash hash table to use DHT.  The reason for doing
so would be to move it out to DSM segments and enable dynamically
growing.  I suspect that the bounded size of things like the hash
tables involved in (for example) predicate locking is considered a
feature, not a bug, so any such cluster-lifetime core-infrastructure
hash table would not be a candidate.  More likely candidates would be
ephemeral data used by the executor, as in the above-mentioned patch,
and long lived caches of dynamic size owned by core code or
extensions.  Like a shared query plan cache, if anyone can figure out
the invalidation magic required.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Idea: Tell compiler palloc() et al return aligned values

Hi,

Just saw a fair amount of memset/memcpy in a profile, with the symbol
name indicating that the compiler assumes the data is unaligned.  After
pondering it for a few moments, that's not too surprising: Even if the
memory is allocated in the same place, the compiler has no way knowing
that palloc et al always return aligned memory.

Seems like a good idea to replace palloc et al with a macro that does
__builtin_assume_aligned(real_palloc(), 8) or such.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash tables in dynamic shared memory

Hi,

> It's impossible to write a general purpose hash table that will be
> suitable for every use case, considering all the combinations of
> design trade offs and subsets of functionality someone might want.

Very much agreed.


> There is other work being done in this space:  I'm aware of Andres's
> current thread[1] on Robin Hood-style closed hashing tables with
> macro-ised size, hash and comparator operations à la C++ container
> templates, and I'm following along with interest.

> He vigorously
> rejects chaining hash tables as the natural enemies of memory
> hardware.

I'd not go quite that far. There are good arguments for open addressing,
and there's good arguments for open chaining. The latter is a lot more
convincing if you want concurrent access with partition based locking,
for example...


> I guess Andres's simplehash should already work in DSA memory nicely
> anyway since it doesn't have any internal pointers IIUC (?).

The data access doesn't have pointers, but the "metadata" does have a
pointer to the data. But that's a very solvable problem.  But
incremental resizing and granular locking aren't going to happen for it,
so it'd really only be suitable for presenting a read-only (or possibly
read-mostly) view.


> Potential use cases for DHT include caches, in-memory database objects
> and working state for parallel execution.

Is there a more concrete example, i.e. a user we'd convert to this at
the same time as introducing this hashtable?


> There are a couple of things I'm not happy about in this version of
> DHT: the space overhead per item is too high (hash + wasted padding +
> next pointer), and the iterator system is overly complicated.  I have
> another version in development which gets rid of the hash and padding
> and sometimes gets rid of the next pointer to achieve zero overhead,
> and I'm working on a simpler iteration algorithm that keeps the
> current guarantees but doesn't need to reorder bucket chains.  More on
> that, with some notes on performance and a testing module, soon.

FWIW, my experimentation showed that getting rid of the hash itself is
quite often *NOT* a worthwhile tradeof. Comparing keys and recomputing
hashes is often quite expensive (e.g. for GROUP BY it's where the
majority of time is spent after the simplehash patch).

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: psql \setfileref

2016-10-04 Thread Gilles Darold

Le 04/10/2016 à 17:29, Pavel Stehule a écrit :
>
>
> 2016-10-04 9:18 GMT+02:00 Gilles Darold  >:
>
> Le 03/10/2016 à 23:23, Gilles Darold a écrit :
> > Le 03/10/2016 à 23:03, Robert Haas a écrit :
> >> On Mon, Oct 3, 2016 at 3:54 PM, Gilles Darold
> mailto:gil...@darold.net>> wrote:
> >>> 4) An other problem is that like this this patch will allow
> anyone to upload into a
> >>> column the content of any system file that can be read by
> postgres system user
> >>> and then allow non system user to read its content.
> >> I thought this was a client-side feature, so that it would let a
> >> client upload any file that the client can read, but not things
> that
> >> can only be read by the postgres system user.
> >>
> > Yes that's right, sorry for the noise, forget this fourth report.
> >
>
> After some more though there is still a security issue here. For a
> PostgreSQL user who also have login acces to the server, it is
> possible
> to read any file that the postgres system user can read, especially a
> .pgpass or a recovery.conf containing password.
>
>
> This patch doesn't introduce any new server side functionality, so if
> there is some vulnerability, then it is exists now too.
>

It doesn't exists, that was my system user which have extended
privilege. You can definitively forget the fouth point.

-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org

[HACKERS] Hash tables in dynamic shared memory

Hi hackers,

Here is an experimental hash table implementation that uses DSA memory
so that hash tables can be shared by multiple backends and yet grow
dynamically. Development name: "DHT".

It's impossible to write a general purpose hash table that will be
suitable for every use case, considering all the combinations of
design trade offs and subsets of functionality someone might want.
What I'm trying to do here is produce something that is generally
useful for many cases and easy to use, along the lines of dynahash,
but use dynamic shared memory. Here is the bullet point summary:

* DSA memory-backed
* open hashing (separate chaining)
* incremental resizing
* built-in partition-based locking
* concurrent iteration

There is other work being done in this space: I'm aware of Andres's
current thread[1] on Robin Hood-style closed hashing tables with
macro-ised size, hash and comparator operations à la C++ container
templates, and I'm following along with interest. He vigorously
rejects chaining hash tables as the natural enemies of memory
hardware. Obviously there are pros and cons: this node-based chaining
design allows resizing, iteration with stronger guarantees, and users
can point directly to (or into) the entries from other data
structures, but yeah... it has to follow pointers into far away cache
lines to do that. I guess Andres's simplehash should already work in
DSA memory nicely anyway since it doesn't have any internal pointers
IIUC (?). As for concurrency, Robert Haas also did some interesting
experiments[2] with (nearly) lock-free hash tables and hazard
pointers. I'm not sure what the optimum number of different
implementations or levels of configurability is, and I'm certainly not
wedded to this particular set of design tradeoffs.

Potential use cases for DHT include caches, in-memory database objects
and working state for parallel execution. (Not a use case: the shared
hash table for my as-yet-unposted DSA-based shared parallel hash join
table work, which follows the existing open-coded dense_alloc-based
approach for reasons I'll explain later.)

There are a couple of things I'm not happy about in this version of
DHT: the space overhead per item is too high (hash + wasted padding +
next pointer), and the iterator system is overly complicated. I have
another version in development which gets rid of the hash and padding
and sometimes gets rid of the next pointer to achieve zero overhead,
and I'm working on a simpler iteration algorithm that keeps the
current guarantees but doesn't need to reorder bucket chains. More on
that, with some notes on performance and a testing module, soon.

Please find attached dht-v1.patch, which depends on dsa-v2.patch[3].
I'm posting this version of DHT today because it is a dependency of
another patch due on this list shortly.

See also simple-cache.patch which contains a simple smoke test
extension so you can type SELECT simple_cache_put('42', 'Hello
world'), SELECT simple_cache_get('42') etc.

Thanks to my colleagues Amit Khandekar, Amul Sul, Dilip Kumar and John
Gorman for bug fixes and suggestions, and Robert Haas for feedback.

Please let me know your thoughts, and thanks for reading.

[1]
https://www.postgresql.org/message-id/flat/20161003041215.tfrifbeext3i7nkj%40alap3.anarazel.de
[2]
https://www.postgresql.org/message-id/CA%2BTgmoYE4t-Pt%2Bv08kMO5u_XN-HNKBWtfMgcUXEGBrQiVgdV9Q%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/flat/CAEepm%3D1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

dht-v1.patch
Description: Binary data

simple-cache.patch
Description: Binary data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Creating a DSA area to provide work space for parallel execution

Hi hackers,

A couple of months ago I proposed dynamic shared areas[1]. DSA areas
are dynamically sized shared memory heaps that backends can use to
share data, building on top of the existing DSM infrastructure.

One target use case for DSA areas is to provide work space for
parallel query execution. To that end, here is a patch to create a
DSA area for use by executor code. The area is automatically attached
to the leader and all worker processes for the duration of the
parallel query, and is available as estate->es_query_area.

Backends already have access to shared memory through a single DSM
segment managed with a table-of-contents. The TOC provides a way to
carve out some shared storage space for individual executor nodes and
look it up later by plan node ID. That works for things like
ParallelHeapScanDescData whose size is known up front, but not so well
if you need something more like a heap in which to build shared data
structures. Through estate->es_query_area, a parallel-aware executor
node can use and recycle arbitrary amounts of shared memory with an
allocate/free interface.

Motivating use cases include shared bitmaps and shared hash tables
(patches to follow).

Currently, this doesn't mean you don't also need the existing DSM
segment. In order share data structures in the DSA area, you need a
way to exchange pointers to find them, and the existing segment + TOC
mechanism is ideal for that.

One obvious problem is that this patch results in at least *two* DSM
segments being created for every parallel query execution: the main
segment used for parallel execution, and then the initial segment
managed by the DSA area. One thought is that DSA areas are the more
general mechanism, so perhaps we should figure out how to store
contents of the existing segment in it. The TOC interface would need
a few tweaks to be able to live in memory allocated with dsa_allocate,
and they we'd need to share that address with other backends so that
they could find it (cf the current approach of finding the TOC at the
start of the segment). I haven't prototyped that yet. That'd involve
changing the wording "InitializeDSM" that appears in various places
including the FDW API, which has been putting me off...

This patch depends on dsa-v2.patch[1].

[1]
https://www.postgresql.org/message-id/flat/CAEepm%3D1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

dsa-area-for-executor-v1.patch
Description: Binary data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Stopping logical replication protocol

Hi,

On 2016-09-23 13:01:27 +0800, Craig Ringer wrote:
> From f98f2388c57d938ebbe07ccd2dbe02138312858f Mon Sep 17 00:00:00 2001
> From: Vladimir Gordiychuk 
> Date: Wed, 7 Sep 2016 00:39:18 +0300
> Subject: [PATCH 2/4] Client-initiated CopyDone during transaction decoding in
>  walsender
> 
> The prior patch caused the walsender to react to a client-initiated
> CopyDone while it's in the WalSndLoop. That's not enough to let clients
> end logical decoding mid-transaction because we loop in ReorderBufferCommit
> during decoding of a transaction without ever returning to WalSndLoop.
> 
> Allow breaking out of ReorderBufferCommit by detecting that client
> input has requested an end to COPY BOTH mode, so clients can abort
> decoding mid-transaction.

Hm, I'm *VERY* doubtful about doing this via yet another callback. Why
don't we just error out in the prepare write callback?

I don't think it's a good idea to return a non-error state if a command
is interrupted in the middle. We might already have sent a partial
transaction or such.   Also compare this to interrupting a query - we
don't just returning rows, we error out.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

On Wed, Oct 5, 2016 at 4:28 AM, Thomas Munro
 wrote:
> On Wed, Oct 5, 2016 at 4:59 AM, Robert Haas  wrote:
>> On Mon, Oct 3, 2016 at 8:43 PM, Michael Paquier
>>  wrote:
>>> The rest looks good to me. Thanks for the feedback and the time!
>>
>> Thanks for the fixes.  I committed this...

Yeh!

>> ... with an additional compile
>> fix, but the buildfarm turned up a few more problems that my 'make
>> check-world' didn't find.  Hopefully those are fixed now, but we'll
>> see.

I saw that after waking up... As usual the buildfarm is catching up
many of the things I missed..

> Nitpicking: the includes in bgworker.c weren't sorted properly, and
> then this patch added "pgstat.h" in the wrong position.  See attached
> suggestion.

Yes, that should be fixed.

And for the rest, sorry for the delay. Timezones...

More seriously, the Windows animals have been complaining about
pg_sleep() crashing the system:
  SELECT pg_sleep(0.1);
! server closed the connection unexpectedly
! This probably means the server terminated abnormally
! before or while processing the request.
! connection to server was lost

And I think that the answer to this crash is in WaitForSingleObject(),
where the macro WAIT_TIMEOUT is already defined, so there is an
overlap with the new declarations in pgstat.h:
https://msdn.microsoft.com/en-us/library/aa450988.aspx
This is also generating a bunch of build warnings now that I compile
HEAD on Windows. Regression tests are not crashing here, but I am
getting a failure in stats.sql and pg_sleep is broken. I swear I
tested that at some point and did not see a crash or those warnings...
But well what's done is done.

It seems to me that a correct answer would be to rename this class ID.
But instead I'd suggest to append the prefix PG_* to all the class
events like in the attached, that passes make-check, contrib-check,
modules-check and builds without warnings on Windows. A more simple
fix would be just to rename WAIT_TIMEOUT to something else but
appending PG_ looks better in the long term.
-- 
Michael

we-msvc-fixes.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Stopping logical replication protocol

2016-10-04 Thread Vladimir Gordiychuk

> Vladimir, can I have your opinion on the latest patch or if you want to
make changes, an updated patch?

I am satisfied with all our changes and I thinks it enough to complete this
PR.

2016-10-03 6:51 GMT+03:00 Craig Ringer :

> On 3 Oct. 2016 10:15, "Michael Paquier"  wrote:
> >
> > On Tue, Sep 27, 2016 at 11:05 AM, Craig Ringer 
> wrote:
> > > On 26 September 2016 at 21:52, Vladimir Gordiychuk 
> wrote:
> > >>>You should rely on time I tests as little as possible. Some of the
> test
> > >>> hosts are VERY slow. If possible something deterministic should be
> used.
> > >>
> > >> That's why this changes difficult to verify. Maybe in that case we
> should
> > >> write benchmark but not integration test?
> > >> In that case we can say, before this changes stopping logical
> replication
> > >> gets N ms but after apply changes it gets NN ms where NN ms less than
> N ms.
> > >> Is it available add this kind of benchmark to postgresql? I will be
> grateful
> > >> for links.
> > >
> > > It's for that reason that I added a message printed only in verbose
> > > mode that pg_recvlogical emits when it's exiting after a
> > > client-initiated copydone.
> > >
> > > You can use the TAP tests, print diag messages, etc. But we generally
> > > want them to run fairly quickly, so big benchmark runs aren't
> > > desirable. You'll notice that I left diag messages in to report the
> > > timing for the results in your tests, I just changed the tests so they
> > > didn't depend on very tight timing for success/failure anymore.
> > >
> > > We don't currently have any automated benchmarking infrastructure.
> >
> > Which seems like this patch is not complete yet.
>
> Personally I think it is. I'm just explaining why I adjusted Vladimir's
> tests to be less timing sensitive and rely on a qualitative property
> instead.
>
> That said, it's had recent change and it isn't a big intrusive change that
> really need attention this cf.
>
> >  I am marking it as
> > returned with feedback, but it may be a better idea to move it to next
> > CF once a new version with updated tests shows up.
>
> I'll move it now. I think the tests are fine, if Vladimir agrees, so IMO
> it's ready to go. More eyes wouldn't hurt though.
>
> If Vladimir wants benchmarking based tests that's a whole separate project
> IMO. Something that will work robustly on the weird slow machines we have
> isn't simple. Probably a new buildfarm option etc since we won't want to
> clutter the really slow old niche boxes with it.
>
> Vladimir, can I have your opinion on the latest patch or if you want to
> make changes, an updated patch?
>
>
>

Re: [HACKERS] Misidentification of Python shared library

Aidan Van Dyk  writes:
> https://packages.debian.org/jessie/amd64/libpython3.4/filelist
> The ".so." is in /usr/lib/, but ".so" symlink is in the
> config directory...

Hah.  I guess that explains why they are setting LIBDIR to /usr/lib.
Thanks for the info!

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Generic type subscription

2016-10-04 Thread Oleg Bartunov

On Sat, Oct 1, 2016 at 12:52 PM, Dmitry Dolgov <9erthali...@gmail.com>
wrote:

> > I've tried to compile this patch with current state of master (commit
> > 51c3e9fade76c12)  and found out that, when configured with
> --enable-cassert,
> > it doesn't pass make check.
>
> Thanks for the feedback. Yes, unexpectedly for me, `ExecEvalExpr` can
> return
> expanded `jbvArray` and `jbvObject` instead `jbvBinary` in both cases. It's
> interesting, that this doesn't break anything, but obviously violates
> the `pushJsonbValueScalar` semantics. I don't think `ExecEvalExpr` should
> be
> changed for jsonb, we can handle this situation in `pushJsonbValue`
> instead. I've
> attached a new version of patch with this modification.
>
>
have you ever run 'make check' ?

=
 53 of 168 tests failed.
=



> On 27 September 2016 at 19:08, Victor Wagner  wrote:
>
>> On Fri, 9 Sep 2016 18:29:23 +0700
>> Dmitry Dolgov <9erthali...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > Regarding to the previous conversations [1], [2], here is a patch
>> > (with some improvements from Nikita Glukhov) for the generic type
>> > subscription. It allows
>> > to define type-specific subscription logic for any data type and has
>> > implementations for the array and jsonb types. There are following
>> > changes in this
>> > patch:
>>
>> I've tried to compile this patch with current state of master (commit
>> 51c3e9fade76c12)  and found out that, when configured with
>> --enable-cassert, it doesn't pass make check.
>>
>> LOG:  server process (PID 4643) was terminated by signal 6: Aborted
>> DETAIL:  Failed process was running:
>> update test_jsonb_subscript set test_json['a'] = '{"b":
>> 1}'::jsonb;
>>
>>
>>
>> --
>>  Victor
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

Re: [HACKERS] Misidentification of Python shared library

Robert Haas  writes:
> On Tue, Oct 4, 2016 at 3:38 PM, Tom Lane  wrote:
>> So I pushed that, and most of the Debian-based buildfarm critters
>> don't like it.  Where does Debian keep the python shlib, pray tell?

> ...must...not...say...I...told...you...so...

Meh.  How about you don't say it, until we get some evidence that that
actually has anything to do with it?

My immediate guess is that Debian decided that sticking the shlib in
${python_configdir} would be a cute idea after all, because it's hard to
think of what else they might've done that would have succeeded with the
old code and not the new.  But I'd rather not waste time on blind fix
attempts when there are people around here who can look at their
installations and tell me.

Also, I notice that on the animals that are now failing, the last
successful runs reported

checking how to link an embedded Python application... -L/usr/lib -lpython2.7 
-lpthread -ldl  -lutil -lm

which would if anything suggest that the shlib *is* in /usr/lib on
these machines.  So maybe I fat-fingered some other aspect entirely,
but I can't yet guess what.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misidentification of Python shared library

On Tue, Oct 4, 2016 at 3:38 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Oct 4, 2016 at 12:57 PM, Tom Lane  wrote:
>>> Sure, but it will only matter given a more-or-less-broken Python
>>> installation.  Anyway, if you have a better idea, let's hear it.
>
>> Sorry, no clue
>
> So I pushed that, and most of the Debian-based buildfarm critters
> don't like it.  Where does Debian keep the python shlib, pray tell?

...must...not...say...I...told...you...so...

Even if you eventually get this working on every flavor of Linux
that's represented in the buildfarm, it's likely to be unreliable on
other versions of Linux and/or other UNIX-like operating systems
because operating system packagers *love* to find new places to store
things, and system administrators expect to be able to compile stuff
with --prefix and have things work after that, at least if they then
pass the right configure flags to the next thing they want to install.
I bet you a nickle that if you include a hard-coded list of paths in
any form, at least one operating system packager is going to have to
patch that hard-coded list in order to get things to work -- and a
quarter that at least one user will have to patch it.

Like I already said, I don't have a better solution, but all of my
experience says that this one is not going to be reliable no matter
how hard you beat it.  I will be happy if you can prove me wrong,
though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-04 Thread John Gorman

SlabContext has a parent context. It can delegate requests that
it cannot handle to the parent context which is GenSlab. Genslab
can send them to the sister aset context.

This could handle all reallocs so there will be no surprises.

Remind me again why we cannot realloc in place for sizes
smaller than chunkSize? GenSlab is happy to allocate smaller
sizes and put them into the fixed size chunks.

Maybe SlabAlloc can be happy with sizes up to chunkSize.

if (size <= set->chunkSize)
return MemoryContextAlloc(set->slab, size);
else
return MemoryContextAlloc(set->aset, size);

On Sat, Oct 1, 2016 at 10:15 PM, Tomas Vondra 
wrote:

> On 10/02/2016 12:23 AM, John Gorman wrote:
>
>> I reproduced the quadradic pfree performance problem and verified that
>> these patches solved it.
>>
>> The slab.c data structures and functions contain no quadradic components.
>>
>> I noticed the sizing loop in SlabContextCreate() and came up with
>> a similar formula to determine chunksPerBlock that you arrived at.
>>
>> Firstly, I've realized there's an issue when chunkSize gets too
>>> large - once it exceeds blockSize, the SlabContextCreate() fails
>>> as it's impossible to place a single chunk into the block. In
>>> reorderbuffer, this may happen when the tuples (allocated in
>>> tup_context) get larger than 8MB, as the context uses
>>> SLAB_LARGE_BLOCK_SIZE (which is 8MB).
>>>
>>> But maybe there's a simpler solution - we may simply cap the
>>> chunkSize (in GenSlab) to ALLOC_CHUNK_LIMIT. That's fine, because
>>> AllocSet handles those requests in a special way - for example
>>> instead of tracking them in freelist, those chunks got freed
>>> immediately.
>>>
>>
>> I like this approach because it fixes the performance problems
>> with smaller allocations and doesn't change how larger
>> allocations are handled.
>>
>>
> One more comment about GenSlab, particularly about unpredictability of the
> repalloc() behavior. It works for "large" chunks allocated in the AllocSet
> part, and mostly does not work for "small" chunks allocated in the
> SlabContext. Moreover, the chunkSize changes over time, so for two chunks
> of the same size, repalloc() may work on one of them and fail on the other,
> depending on time of allocation.
>
> With SlabContext it's perfectly predictable - repalloc() call fails unless
> the chunk size is exactly the same as before (which is perhaps a bit
> pointless, but if we decide to fail even in this case it'll work 100% time).
>
> But with GenSlabContext it's unclear whether the call fails or not.
>
> I don't like this unpredictability - I'd much rather have consistent
> failures (making sure people don't do repalloc() on with GenSlab). But I
> don't see a nice way to achieve that - the repalloc() call does not go
> through GenSlabRealloc() at all, but directly to SlabRealloc() or
> AllocSetRealloc().
>
> The best solution I can think of is adding an alternate version of
> AllocSetMethods, pointing to a different AllocSetReset implementation.
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: [HACKERS] Misidentification of Python shared library

Robert Haas  writes:
> On Tue, Oct 4, 2016 at 12:57 PM, Tom Lane  wrote:
>> Sure, but it will only matter given a more-or-less-broken Python
>> installation.  Anyway, if you have a better idea, let's hear it.

> Sorry, no clue

So I pushed that, and most of the Debian-based buildfarm critters
don't like it.  Where does Debian keep the python shlib, pray tell?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

On Wed, Oct 5, 2016 at 4:59 AM, Robert Haas  wrote:
> On Mon, Oct 3, 2016 at 8:43 PM, Michael Paquier
>  wrote:
>> The rest looks good to me. Thanks for the feedback and the time!
>
> Thanks for the fixes.  I committed this with an additional compile
> fix, but the buildfarm turned up a few more problems that my 'make
> check-world' didn't find.  Hopefully those are fixed now, but we'll
> see.

Nitpicking: the includes in bgworker.c weren't sorted properly, and
then this patch added "pgstat.h" in the wrong position.  See attached
suggestion.

-- 
Thomas Munro
http://www.enterprisedb.com


nitpicking.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] condition variables

On Mon, Aug 15, 2016 at 5:58 PM, Thomas Munro
 wrote:
> Please find attached a -v2 of your patch which includes suggestions
> 1-3 above.

Here's a rebased patch.  ConditionVariableSleep now takes
wait_event_info.  Anyone using this in patches for core probably needs
to add enumerators to the WaitEventXXX enums in pgstat.h to describe
their wait points.

-- 
Thomas Munro
http://www.enterprisedb.com

condition-variable-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] longfin

Robert Haas  writes:
> So, buildfarm member longfin is failing with:
> test python3/hstore_plpython  ... FAILED (test process exited with exit code 
> 2)

Yeah, this is a side effect of what I'm working on, don't sweat it.

(The only reason it's passed at all today is that I put a copy of
libpython.dylib into where the existing configure script is expecting
to find it.  But I undid that after verifying that the build passed
with it.  I'm about to commit a proper fix.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Cache Hash Index meta page.

2016-10-04 Thread Jeff Janes

On Fri, Jul 22, 2016 at 3:02 AM, Mithun Cy 
wrote:

> I have created a patch to cache the meta page of Hash index in
> backend-private memory. This is to save reading the meta page buffer every
> time when we want to find the bucket page. In “_hash_first” call, we try
> to read meta page buffer twice just to make sure bucket is not split after
> we found bucket page. With this patch meta page buffer read is not done, if
> the bucket is not split after caching the meta page.
>
> Idea is to cache the Meta page data in rd_amcache and store maxbucket
> number in hasho_prevblkno of bucket primary page (which will always be NULL
> other wise, so reusing it here for this cause!!!). So when we try to do
> hash lookup for bucket page if locally cached maxbucket number is greater
> than or equal to bucket page's maxbucket number then we can say given
> bucket is not split after we have cached the meta page. Hence avoid reading
> meta page buffer.
>
> I have attached the benchmark results and perf stats (refer
> hash_index_perf_stat_and_benchmarking.odc [sheet 1: perf stats; sheet 2:
> Benchmark results). There we can see improvements at higher clients, as
> lwlock contentions due to buffer read are more at higher clients. If I
> apply the same patch on Amit's concurrent hash index patch [1] we can see
> improvements at lower clients also. Amit's patch has removed a heavy weight
> page lock which was the bottle neck at lower clients.
>
> [1] Concurrent Hash Indexes 
>

Hi Mithun,

Can you describe your benchmarking machine?  Your benchmarking data went up
to 128 clients.  But how many cores does the machine have?  Are you testing
how well it can use the resources it has, or how well it can deal with
oversubscription of the resources?

Also, was the file supposed to be named .ods?  I didn't find it to be
openable as an .odc file.

Cheers,

Jeff

Re: [HACKERS] pgbench more operators & functions

2016-10-04 Thread Fabien COELHO



Hello Robert,


I think it's pretty clear that this patch is not Ready for Committer,


As a reviewer, I do not know when to decide to put something as "ready". 
My opinion is that it is a matter of the reviewer to decide. Whether some 
consensus is actually reached, or whether a committer is going to disagree 
later on, cannot be helped.



because there's no consensus that we want this, and like Tom and
Stephen, I'd argue that there are large parts of it we don't want.



The documentation in the latest patch version mentions XOR and IF
which we definitely don't want because there is no similar thing in
SQL,


I have removed these and put CASE WHEN THEN ELSE END instead in v6.


but in addition to that, I don't think much of an argument has
been made that any of this is actually useful.


In the TPC-B benchmark, some conditional is needed because under some 
probability an account must be chosen in the *same* branch as the teller, 
otherwise in the *other* branches.


I'm skeptical about the notion that giving pgbench a vast repertoire of 
mathematical functions is a good idea.  What does that actually let us 
do that is useful and not possible today?


I do not see a vast "repertoire" of functions. There are "usual" int 
operators, logical operators, and a few functions.


About the one added in this patch:

bitwise operations: I have seen some use to create a non uniform random 
from a random one. Once one operator is put in, there is no reason not to 
put the others...


exp & ln: could be used to tweak distributions.

conditional: see above.

I have not put trigonometric functions because I could not think of a 
use in a benchmarking context.



I'm happy to see pgbench made better in a variety of ways, but I don't
really see why that particular thing is useful.  Perhaps I'm just
missing something.


I'm trying to add features that are IMO useful for benchmarking.

When doing so, someone says "hay, you put a double expression, you must 
put double variables". Although I can see the point of double expressions 
for passing ints into some transformations, I can't see a double variable 
really useful in any benchmark, but there it is, it is a side effect of 
the process, and it is somehow to have orthogonal features.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Question / requests.

On Mon, Oct 3, 2016 at 5:44 PM, Alvaro Herrera  wrote:
> Robert Haas wrote:
>> On Fri, Sep 30, 2016 at 11:20 AM, Francisco Olarte
>>  wrote:
>> > After some messages due to vacuumdb auto-deadlocking itself on the
>> > system tables when doing paralell vacuum of a full database I
>> > suggested adding some flags to make vacuumdb process schemas. I was
>> > asked wether I could write a patch for that and I am thinking on doing
>> > it.
>>
>> What messages are you seeing, exactly? "auto-deadlocking" isn't a thing.
>
> https://www.postgresql.org/message-id/57EBC9AE.2060302%40163.com
>
> I wonder if the real answer isn't just to disallow -f with parallel
> vacuuming.

Seems like we should figure out which catalog tables are needed in
order to perform a VACUUM, and force those to be done last and one at
a time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] longfin

So, buildfarm member longfin is failing with:

test python3/hstore_plpython  ... FAILED (test process exited with exit code 2)

It claims that it's been failing since

6f3bd98 Tue Oct 4 15:01:42 2016 UTC  Extend framework from commit
53be0b1ad to report latch waits.

which would tend to pointer the finger at my commit that broke
everything else today.  However, I'm suspicious that this is actually
the fault of d51924be886c2a05e691fa05b16cb6b30ab8370f and/or
490ed1ebb9b26fb342a1e3240107092e9c5aec02 because I don't see how
changes to the latch stuff would break plpython.

However, I obviously missed a few things in 6f3bd98, so feel free to
tell me this is another case of me being dumb.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Renaming of pg_xlog and pg_clog

On Tue, Oct 4, 2016 at 1:48 AM, Michael Paquier
 wrote:
> Yes, pg_wal is fine as new name in replacement of pg_xlog. Now for the
> pg_clog...

Of course the irony here is that "WAL" stands for "Write Ahead Log".
So we're renaming a directly that has "log" in the name (pg_xlog) to a
directory that has an "l" in the name which stands for "log" (pg_wal).
Whee!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
 wrote:
>> Even if we leave the empty relfilenode around for now -- in the long
>> run I think it should die -- I think we should prohibit the creation
>> of subsidiary object on the parent which is only sensible if it has
>> rows - e.g. indexes.  It makes no sense to disallow non-inheritable
>> constraints while allowing indexes, and it could box us into a corner
>> later.
>
> I agree.  So we must prevent from the get-go the creation of following
> objects on parent tables (aka RELKIND_PARTITIONED_TABLE relations):
>
> * Indexes
> * Row triggers (?)

Hmm, do we ever fire triggers on the parent for operations on a child
table?  Note this thread, which seems possibly relevant:

https://www.postgresql.org/message-id/flat/cd282adde5b70b20c57f53bb9ab75e27%40biglumber.com

We certainly won't fire the parent's per-row triggers ever, so those
should be prohibited.  But I'm not sure about per-statement triggers.

> In addition to preventing creation of these objects, we must also teach
> commands that directly invoke heapam.c to skip such relations.

I'm don't think that's required if we're not actually getting rid of
the relfilenode.

> In this case, relkind refers to the command viz. DROP 
> .  It can never be RELKIND_PARTITIONED_TABLE, so the above will
> be equivalent to leaving things unchanged.  Anyway I agree the suggested
> style is better, but I assume you meant:
>
> if (classform->relkind == RELKIND_PARTITIONED_TABLE)
> expected_relkind = RELKIND_RELATION;
> else
> expected_relkind = classform->relkind;
>
> if (relkind != expected_relkind)
> DropErrorMsgWrongType(rel->relname, classform->relkind, relkind);

Uh, yeah, probably that's what I meant. :-)

> Thanks for the explanation. I added a new field to the catalog called
> partcollation.
>
> That means a couple of things - when comparing input key values (consider
> a new tuple being routed) to partition bounds using the btree compare
> function, we use the corresponding key column's partcollation.  The same
> is also used as inputcollid of the OpExpr (or ScalarArrayOpExpr) in the
> implicitly generated CHECK constraints.

Check.

> However, varcollid of any Var nodes in the above expressions is still the
> corresponding attributes' attcollation.

I think that's OK.

> Needless to say, a query-specified qualification clause must now include
> the same collate clause as used for individual partition columns (or
> expressions) for the constraint exclusion to work as intended.

Hopefully this is only required if the default choice of collation
wouldn't be correct anyway.

>> +  partexprbin
>>
>> Is there any good reason not to do partexprbin -> partexpr throughout the 
>> patch?
>
> Agreed. Though I used partexprs (like indexprs).

Oh, good idea.

>>  case EXPR_KIND_TRIGGER_WHEN:
>>  return "WHEN";
>> +case EXPR_KIND_PARTITION_KEY:
>> +return "partition key expression";
>>
>> I think you should say "PARTITION BY" here.  See the function header
>> comment for ParseExprKindName.
>
> Used "PARTITION BY".  I was trying to mimic EXPR_KIND_INDEX_EXPRESSION,
> but this seems to work better.  Also, I renamed EXPR_KIND_PARTITION_KEY to
> EXPR_KIND_PARTITION_EXPRESSION.
>
> By the way, a bunch of error messages say "partition key expression".  I'm
> assuming it's better to leave them alone, that is, not rewrite them as
> "PARTITION BY expressions".

I think that is fine.  ParseExprKindName is something of a special case.

Reviewing 0002:

+prettyFlags = PRETTYFLAG_INDENT;
+PG_RETURN_TEXT_P(string_to_text(pg_get_partkeydef_worker(relid,
+prettyFlags)));

Why bother with the variable?

+/* Must get partclass, and partexprs the hard way */
+datum = SysCacheGetAttr(PARTEDRELID, tuple,
+Anum_pg_partitioned_table_partclass, &isnull);
+Assert(!isnull);
+partclass = (oidvector *) DatumGetPointer(datum);

Comment mentions getting two things, but code only gets one.

+partexprs = (List *) stringToNode(exprsString);

if (!IsA(partexprs, List)) elog(ERROR, ...); so as to guard against
corrupt catalog contents.

+switch (form->partstrat)
+{
+case 'l':
+appendStringInfo(&buf, "LIST");
+break;
+case 'r':
+appendStringInfo(&buf, "RANGE");
+break;
+}

default: elog(ERROR, "unexpected partition strategy: %d", (int)
form->partstrat);

Also, why not case PARTITION_STRATEGY_LIST: instead of case 'l': and
similarly for 'r'?

@@ -5296,7 +5301,8 @@ getTables(Archive *fout, int *numTables)
   "OR %s IS NOT NULL "
   "OR %s IS NOT NULL"
   "))"
-  "AS changed_acl "
+  "AS changed_acl, "
+  "CASE WHEN c.relkind = 'P' THEN
pg_catalog.pg_get_partkeydef(c.oid) EL

Re: [HACKERS] Misidentification of Python shared library

On Tue, Oct 4, 2016 at 12:57 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Oct 4, 2016 at 11:45 AM, Tom Lane  wrote:
>>> "$python_libdir /usr/lib64 /usr/lib" (too bad there's no easy
>>> platform-independent way to find out the linker's default search path).
>
>> "too bad" sounds like a grave understatement of the perils involved
>> using a fixed list of paths which may or may not bear any resemblance
>> at all to what the platform actually uses.
>
> Sure, but it will only matter given a more-or-less-broken Python
> installation.  Anyway, if you have a better idea, let's hear it.

Sorry, no clue

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misidentification of Python shared library

Robert Haas  writes:
> On Tue, Oct 4, 2016 at 11:45 AM, Tom Lane  wrote:
>> "$python_libdir /usr/lib64 /usr/lib" (too bad there's no easy
>> platform-independent way to find out the linker's default search path).

> "too bad" sounds like a grave understatement of the perils involved
> using a fixed list of paths which may or may not bear any resemblance
> at all to what the platform actually uses.

Sure, but it will only matter given a more-or-less-broken Python
installation.  Anyway, if you have a better idea, let's hear it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench more operators & functions

2016-10-04 Thread Fabien COELHO



Hello Stephen,


I'm pretty sure we should hold off on adding 'xor' [...]


So I removed "xor" in the attached version.


[...] I certainly understand how the if() function could be useful


Indeed, some kind of "if" is needed, for instance to implement
"tpc-b" correctly.


That's an interesting point..  Have you thought about ripping out the 
built-in TPC-B-like functionality of pgbench and making that into a 
script instead? Doing so would likely point out places where we need to 
add functionality or cases which aren't handled very well. [...]


My point is not that pgbench should do tcp-b-real by default, but that it 
should be possible to do it, and currently this is not possible. People 
rely on what pgbench is doing so it should not be (significantly) changed, 
at least in its default behavior.



I'm not quite sure that I follow why you feel that CASE would be too
difficult to implement here..?


Obviously, everything is possible, it just takes more time. I've added the 
generic case expression syntax in the attached version.



[...] If we're going to replace the built-in TPC-B functionality


As stated above, I do *NOT* want to replace the existing functionality, I 
just want to make pgbench able to do more in a sensible & useful way for 
benchmarking.


I'm not currently planing to add user functions and control structures 
outside of expressions in pgbench without a very strong user case.


[...] Perhaps it'd be difficult to rework pgbench to support it, but I 
do think it's something we should at least be considering and making 
sure we don't wall ourselves off from implementing in the future. Most 
scripting languages do support functions of one form or another.


Sure.


Lastly, we should really add some regression tests to pgbench.  We
already have the start of a TAP test script which it looks like it
wouldn't be too hard to add on to.


I agree that pgbench should be tested systematically. I think that
this is not limited to these functions and operators but a more
general and desirable feature, thus is really material for another
patch.


I don't agree with this- at a minimum, this patch should add regression
tests for the features that it's adding.


This has not been the case with the previous dozens of patches about 
pgbench, but maybe it should start somewhere:-).


There was already one TAP script for pgbench (for testing concurrent oid 
insertion, not really a pgbench functionnality), I added another one which 
focusses on expressions.


Changes in v6:
 - remove xor operator
 - remove if function
 - make keywords case insensitive (more like SQL)
 - add generic case expression syntax
   (note that the typing is the same strange one as pg, i.e.
CASE WHEN (RANDOM() > 0.5) THEN 1 ELSE 1.0 END;
can be either an int or a double, depending, this is not statically typed...
 - add TAP test about pgbench expressions

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 285608d..75575ad 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -821,11 +821,11 @@ pgbench  options  dbname
   The expression may contain integer constants such as 5432,
   double constants such as 3.14159,
   references to variables :variablename,
-  unary operators (+, -) and binary operators
-  (+, -, *, /,
-  %) with their usual precedence and associativity,
-  function calls, and
-  parentheses.
+  operators
+  with their usual precedence and associativity,
+  function calls,
+  SQL CASE generic conditional expressions
+  and parentheses.
  
 
  
@@ -909,6 +909,165 @@ pgbench  options  dbname
   
  
 
+ 
+  Built-In Operators
+
+  
+   The arithmetic, bitwise, comparison and logical operators listed in
+are built into pgbench
+   and may be used in expressions appearing in
+   \set.
+  
+
+  
+   pgbench Operators by increasing precedence
+   
+
+ 
+  Operator
+  Description
+  Example
+  Result
+ 
+
+
+ 
+  or
+  logical or
+  5 or 0
+  1
+ 
+ 
+  and
+  logical and
+  3 and 0
+  0
+ 
+ 
+  |
+  integer bitwise OR
+  1 | 2
+  3
+ 
+ 
+  #
+  integer bitwise XOR
+  1 # 3
+  2
+ 
+ 
+  &
+  integer bitwise AND
+  1 & 3
+  1
+ 
+ 
+  =
+  is equal
+  5 = 4
+  0
+ 
+ 
+  <>
+  is not equal
+  5 <> 4
+  1
+ 
+ 
+  !=
+  is not equal
+  5 != 5
+  0
+ 
+ 
+  <
+  lower than
+  5 < 4
+  0
+ 
+ 
+  <=
+  lower or equal
+  5 <= 4
+  0
+ 
+ 
+  >
+  greater than
+  5 > 4
+  1
+ 
+ 
+  >=
+  greater or equal
+  5 >= 4
+  1
+ 
+ 
+  <<
+  left shift
+  1 << 2
+  4
+ 
+ 
+  >>
+  right shift
+  8 >> 2
+  2
+ 
+ 
+  +
+

Re: [HACKERS] pgbench more operators & functions

On Sun, Oct 2, 2016 at 9:41 AM, Michael Paquier
 wrote:
> On Sat, Oct 1, 2016 at 11:35 PM, Fabien COELHO  wrote:
>> Attached version changes:
>>  - removes C operators not present in psql
>>  - document operators one per line
>
> Moved to next CF with same status: "Ready for committer".

I think it's pretty clear that this patch is not Ready for Committer,
because there's no consensus that we want this, and like Tom and
Stephen, I'd argue that there are large parts of it we don't want.
The documentation in the latest patch version mentions XOR and IF
which we definitely don't want because there is no similar thing in
SQL, but in addition to that, I don't think much of an argument has
been made that any of this is actually useful.  I'm skeptical about
the notion that giving pgbench a vast repertoire of mathematical
functions is a good idea.  What does that actually let us do that is
useful and not possible today?

I'm happy to see pgbench made better in a variety of ways, but I don't
really see why that particular thing is useful.  Perhaps I'm just
missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical tape pause/resume


On 10/04/2016 07:06 PM, Claudio Freire wrote:

On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas  wrote:

1. The first patch changes the way we store the logical tapes on disk.
Instead of using indirect blocks, HFS style, the blocks form a linked list.
Each block has a trailer, with the block numbers of the previous and next
block of the tape. That eliminates the indirect blocks, which simplifies the
code quite a bit, and makes it simpler to implement the pause/resume
functionality in the second patch. It also immediately reduces the memory
needed for the buffers, from 3 to 1 block per tape, as we don't need to hold
the indirect blocks in memory.


Doesn't that make prefetching more than a buffer ahead harder?


Yes, indeed. That's a potential downside, if we wanted to do that. We 
rely wholly on OS readahead for that currently, so it doesn't matter. I 
don't think we want to start issuing explicit pre-fetching calls here 
any time soon. But if we did, we could presume that the pages in 
sequential order for prefetching purposes, and that would work pretty 
well in practice (that's what the OS readahead is doing for us now). Or 
we could try to enforce that by allocating blocks in larger chunks.


In short, I'm OK with that.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Extend framework from commit 53be0b1ad to report latch waits.

2016-10-04 Thread Alvaro Herrera

Robert Haas wrote:
> On Tue, Oct 4, 2016 at 11:29 AM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> Apparently, 'make world' does not build worker_spi.  I thought 'make
> >> world' was supposed to build everything?
> >
> > You'd have thunk, yeah.  It looks like the issue is that src/Makefile
> > is selective about recursing into certain subdirectories of test/,
> > but mostly not test/ itself.  src/test/Makefile naively believes it's
> > in charge, though.  Probably that logic ought to get shoved down one
> > level, and then adjusted so that src/test/modules gets built by "all".
> > Or else teach top-level "make world" to do "make all" in src/test/,
> > but that seems like it's just doubling down on confusing interconnections.
> 
> I think this confusion probably resulted from the move of certain
> things from contrib to src/test/modules, although that might be wrong.

I think you're right -- I tried to avoid recursing into src/test/modules
as much as possible when doing the move, so that it'd not be a bother
when someone just wants to build/package the server.  But obviously for
the purposes of testing the overall build that is the wrong thing to do,
and I agree that we should revisit those decisions.

> At any rate, just as contrib isn't built by 'make all', one could
> argue that src/test/modules shouldn't be built, either.  Then again,
> if 'make check' depends on it, maybe it needs to be.

Hmm, does it?  AFAICS (toplevel GNUmakefile) only src/test/regress is
recursed onto for "make check".

> Either way, 'make world' should certainly build it, I think.

I can buy that.

> Interesting, 'make check-world' tried to compile test_shm_mq, so I
> caught the WaitLatch call there before committing.  I guess it only
> did that because src/test/modules/test_shm_mq has a regression test,
> though.  work_spi does not, so even though they're both under
> src/test/modules, one eventually got built and the other did not.
> That's not so great.

This sounds broken to me, actually.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Claudio Freire

On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas  wrote:
> 1. The first patch changes the way we store the logical tapes on disk.
> Instead of using indirect blocks, HFS style, the blocks form a linked list.
> Each block has a trailer, with the block numbers of the previous and next
> block of the tape. That eliminates the indirect blocks, which simplifies the
> code quite a bit, and makes it simpler to implement the pause/resume
> functionality in the second patch. It also immediately reduces the memory
> needed for the buffers, from 3 to 1 block per tape, as we don't need to hold
> the indirect blocks in memory.

Doesn't that make prefetching more than a buffer ahead harder?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misidentification of Python shared library

On Tue, Oct 4, 2016 at 11:45 AM, Tom Lane  wrote:
> "$python_libdir /usr/lib64 /usr/lib" (too bad there's no easy
> platform-independent way to find out the linker's default search path).

"too bad" sounds like a grave understatement of the perils involved
using a fixed list of paths which may or may not bear any resemblance
at all to what the platform actually uses.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

On Tue, Oct 4, 2016 at 5:05 AM, Tsunakawa, Takayuki
 wrote:
> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> FWIW, I'm pretty much -1 on messing with the timing of the socket close
>> actions.  I broke that once within recent memory, so maybe I'm gun-shy,
>> but I think that the odds of unpleasant side effects greatly outweigh any
>> likely benefit there.
>
> I couldn't find any relevant mails in pgsql-hackers.  I found no problem with 
> the attached patch.  Do you think this is OK?

I have no opinion on this patch, because I haven't reviewed it, but
note recent commit 3b90e38c5d592ea8ec8236287dd5c749fc041728, which
appears to be semi-related.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical tape pause/resume


On 10/04/2016 05:58 PM, Simon Riggs wrote:

On 4 October 2016 at 12:47, Heikki Linnakangas  wrote:


Why not just make each new run start at a block boundary?
That way we waste on average BLCKSZ/2 disk space per run, which is
negligible but we avoid any need to have code to read back in the last
block.


Hmm. You'd still have to read back the last block, so that you can update
its next-pointer.


If each run is in its own file, then you can skip that bit.


Then you need to remember the names of the files, in memory. That's 
closer to my idea of having only one run on each tape, but you're taking 
further by also saying that each tape is a separate file.


That might be a good idea for building the initial runs. Each file would 
then be written and read sequentially, so we could perhaps get rid of 
the whole pre-reading code, and rely completely on OS read-ahead.


However, can't really do that for multi-pass merges, because you want to 
reuse the space of the input tapes for the output tape, as you go.



And we do want the sort to disk to use multiple files so we can
parallelize I/O as well as CPU.


Huh? Why can't you parallelize I/O on a single file?

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracking wait event for latches

On Mon, Oct 3, 2016 at 8:43 PM, Michael Paquier
 wrote:
> The rest looks good to me. Thanks for the feedback and the time!

Thanks for the fixes.  I committed this with an additional compile
fix, but the buildfarm turned up a few more problems that my 'make
check-world' didn't find.  Hopefully those are fixed now, but we'll
see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [COMMITTERS] pgsql: Extend framework from commit 53be0b1ad to report latch waits.

On Tue, Oct 4, 2016 at 11:29 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> Apparently, 'make world' does not build worker_spi.  I thought 'make
>> world' was supposed to build everything?
>
> You'd have thunk, yeah.  It looks like the issue is that src/Makefile
> is selective about recursing into certain subdirectories of test/,
> but mostly not test/ itself.  src/test/Makefile naively believes it's
> in charge, though.  Probably that logic ought to get shoved down one
> level, and then adjusted so that src/test/modules gets built by "all".
> Or else teach top-level "make world" to do "make all" in src/test/,
> but that seems like it's just doubling down on confusing interconnections.

I think this confusion probably resulted from the move of certain
things from contrib to src/test/modules, although that might be wrong.
At any rate, just as contrib isn't built by 'make all', one could
argue that src/test/modules shouldn't be built, either.  Then again,
if 'make check' depends on it, maybe it needs to be.  Either way,
'make world' should certainly build it, I think.

Interesting, 'make check-world' tried to compile test_shm_mq, so I
caught the WaitLatch call there before committing.  I guess it only
did that because src/test/modules/test_shm_mq has a regression test,
though.  work_spi does not, so even though they're both under
src/test/modules, one eventually got built and the other did not.
That's not so great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misidentification of Python shared library

Andrew Dunstan  writes:
> On 10/04/2016 10:43 AM, Tom Lane wrote:
>> In short: I propose replacing all of this logic with "if there's something
>> in $python_libdir that has the right name to be a python shared library,
>> use that, else try the same in $python_configdir, else fail".  Thoughts?

> Seems reasonable.

Actually, given that we no longer support Python < 2.3, I am not sure
we need to bother with looking in $python_configdir at all.

But looking at the installations on my older Apple boxes, it seems that
*neither* $python_libdir nor $python_configdir contain any library at all;
rather libpython.dylib sits in good old /usr/lib and is found there
despite the -L switch pointing to noplace useful.  So I'm now thinking
that we should do the above dance in a list of directories like
"$python_libdir /usr/lib64 /usr/lib" (too bad there's no easy
platform-independent way to find out the linker's default search path).
We can add $python_configdir to that if it proves necessary, but I'll try
it without first.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: ON UPDATE REMOVE foreign key action

2016-10-04 Thread Kirill Berezin

Disclaimer: sorry, i dont understand, should i reply to each of you
personally, or just answer to channel. Some feedbacks were sended in
personal, and some include channel copy.

Thanks for responses, you understand it correctly.

When i said "anybody", i mean inclusive owner himself. For example cookie
poisoning.
There is no "another" session, technically. They similar to the server,
they even can have same IP.
Yes, we can't prevent it with CSRF cookies, but it is not the point of
current conversation.

I can make business logic outside table: make extra query. Im just dont
like how it looks from perspective of encapsulation.
Each table should describe itself, like object in OOP language. With SQL
constructions or triggers/constraits.

Second part of my use case is data cache. When user update
version(generation), cache should be flushed. As easy example: imagine i am
fetching currency value. And till end of the day, i am honor current
course. (Or any other information, that has certain origin checkpoints).
When i am updating origin state (current day, server version, ip address,
neural network generation), i am have to invalidate all previous data.

Like i am calculating digits of the square root, of some number. The more i
spend time, the closer my approx value to irrational result. But when
original value has changed - all previous data does not make sense. I am
flushing it and starting from digit 1.

This is allegorical examples to my real-world cases. I may try imagine some
hypothetical situations, when this functionality more welcomed. But, i am
respect reasons why do not apply this proposal. If my update didn't shift
the balance, its ok. on update trigger is not such painful.

Re: [HACKERS] proposal: psql \setfileref

2016-10-04 Thread Pavel Stehule

2016-10-04 9:18 GMT+02:00 Gilles Darold :

> Le 03/10/2016 à 23:23, Gilles Darold a écrit :
> > Le 03/10/2016 à 23:03, Robert Haas a écrit :
> >> On Mon, Oct 3, 2016 at 3:54 PM, Gilles Darold 
> wrote:
> >>> 4) An other problem is that like this this patch will allow anyone to
> upload into a
> >>> column the content of any system file that can be read by postgres
> system user
> >>> and then allow non system user to read its content.
> >> I thought this was a client-side feature, so that it would let a
> >> client upload any file that the client can read, but not things that
> >> can only be read by the postgres system user.
> >>
> > Yes that's right, sorry for the noise, forget this fourth report.
> >
>
> After some more though there is still a security issue here. For a
> PostgreSQL user who also have login acces to the server, it is possible
> to read any file that the postgres system user can read, especially a
> .pgpass or a recovery.conf containing password.
>

This patch doesn't introduce any new server side functionality, so if there
is some vulnerability, then it is exists now too.

Regards

Pavel


>
>
> --
> Gilles Darold
> Consultant PostgreSQL
> http://dalibo.com - http://dalibo.org
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] [COMMITTERS] pgsql: Extend framework from commit 53be0b1ad to report latch waits.

Robert Haas  writes:
> Apparently, 'make world' does not build worker_spi.  I thought 'make
> world' was supposed to build everything?

You'd have thunk, yeah.  It looks like the issue is that src/Makefile
is selective about recursing into certain subdirectories of test/,
but mostly not test/ itself.  src/test/Makefile naively believes it's
in charge, though.  Probably that logic ought to get shoved down one
level, and then adjusted so that src/test/modules gets built by "all".
Or else teach top-level "make world" to do "make all" in src/test/,
but that seems like it's just doubling down on confusing interconnections.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misidentification of Python shared library

2016-10-04 Thread Andrew Dunstan




On 10/04/2016 10:43 AM, Tom Lane wrote:

While chasing down last night's failure on buildfarm member longfin,
I came across an interesting factoid.  If you just do

./configure --enable-shared
make
make install

in unmodified Python sources, what you will get is an install tree in
which libpython.so (or local equivalent such as .dylib) is installed
in /usr/local/lib, while libpython.a is installed in a directory named
like /usr/local/lib/python3.5/config-3.5m.  I've verified this behavior
on both Linux and macOS.



Wow, these modules have uncovered a number of cans of worms.





In short: I propose replacing all of this logic with "if there's something
in $python_libdir that has the right name to be a python shared library,
use that, else try the same in $python_configdir, else fail".  Thoughts?





Seems reasonable.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Commit fest 2016-09 is now closed

2016-10-04 Thread Fabrízio de Royes Mello

On Mon, Oct 3, 2016 at 2:05 AM, Michael Paquier 
wrote:
>
> On Mon, Oct 3, 2016 at 2:01 PM, Michael Paquier
>  wrote:
> > There was something like 60~70 patches still listed as active in the
> > CF app on Friday my time, so doing the vacuum cleanup has taken some
> > time by marking patches as returned with feedback, rejected (few) and
> > moved to next CF. A couple of things I noted when going through the
> > remaining entries:
> > - a lot of folks have registered to review a patch and did not post a
> > review. Please avoid putting your name on a patch if you can't afford
> > the time to look at a patch.
> > - The rule one patch sent = one patch review of equal difficulty still
works.
> > - Be careful of the status of the patch, and update it correctly. I
> > have noticed a lot of improvements in this area actually!
>
> Forgot something: if you feel that your patch did not receive the
> correct treatment, of course feel free to update its status in the CF
> app, feel free as well to complain at me and/or on this thread if
> necessary. I am sure we will be able to sort things out :)

Hi Michael,

Thank you so much for your help. I was travelling to take a rest after a
long and busy month.

I'll update my statistics and send another email to check the % of the work
was done.

Thanks again,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello

Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Simon Riggs

On 4 October 2016 at 12:47, Heikki Linnakangas  wrote:

>> Why not just make each new run start at a block boundary?
>> That way we waste on average BLCKSZ/2 disk space per run, which is
>> negligible but we avoid any need to have code to read back in the last
>> block.
>
>
> Hmm. You'd still have to read back the last block, so that you can update
> its next-pointer.

If each run is in its own file, then you can skip that bit.

And we do want the sort to disk to use multiple files so we can
parallelize I/O as well as CPU.

So since we know we'll want multiple files, we should be thinking
about how to split things up between files.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Covering + unique indexes.

2016-10-04 Thread Anastasia Lubennikova


03.10.2016 15:29, Anastasia Lubennikova:

03.10.2016 05:22, Michael Paquier:

On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova
 wrote:

Ok, I'll write it in a few days.

Marked as returned with feedback per last emails exchanged.


The only complaint about this patch was a lack of README,
which is fixed now (see the attachment). So, I added it to new CF,
marked as ready for committer.


One more fix for pg_upgrade.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index d4f9090..99735ce 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name);
 static HTAB *createConnHash(void);
 static void createNewConnection(const char *name, remoteConn *rconn);
 static void deleteConnection(const char *name);
-static char **get_pkey_attnames(Relation rel, int16 *numatts);
+static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts);
 static char **get_text_array_contents(ArrayType *array, int *numitems);
 static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals);
 static char *get_sql_delete(Relation rel, int *pkattnums, int pknumatts, char **tgt_pkattvals);
@@ -1483,7 +1483,7 @@ PG_FUNCTION_INFO_V1(dblink_get_pkey);
 Datum
 dblink_get_pkey(PG_FUNCTION_ARGS)
 {
-	int16		numatts;
+	int16		indnkeyatts;
 	char	  **results;
 	FuncCallContext *funcctx;
 	int32		call_cntr;
@@ -1509,7 +1509,7 @@ dblink_get_pkey(PG_FUNCTION_ARGS)
 		rel = get_rel_from_relname(PG_GETARG_TEXT_P(0), AccessShareLock, ACL_SELECT);
 
 		/* get the array of attnums */
-		results = get_pkey_attnames(rel, &numatts);
+		results = get_pkey_attnames(rel, &indnkeyatts);
 
 		relation_close(rel, AccessShareLock);
 
@@ -1529,9 +1529,9 @@ dblink_get_pkey(PG_FUNCTION_ARGS)
 		attinmeta = TupleDescGetAttInMetadata(tupdesc);
 		funcctx->attinmeta = attinmeta;
 
-		if ((results != NULL) && (numatts > 0))
+		if ((results != NULL) && (indnkeyatts > 0))
 		{
-			funcctx->max_calls = numatts;
+			funcctx->max_calls = indnkeyatts;
 
 			/* got results, keep track of them */
 			funcctx->user_fctx = results;
@@ -2021,10 +2021,10 @@ dblink_fdw_validator(PG_FUNCTION_ARGS)
  * get_pkey_attnames
  *
  * Get the primary key attnames for the given relation.
- * Return NULL, and set numatts = 0, if no primary key exists.
+ * Return NULL, and set indnkeyatts = 0, if no primary key exists.
  */
 static char **
-get_pkey_attnames(Relation rel, int16 *numatts)
+get_pkey_attnames(Relation rel, int16 *indnkeyatts)
 {
 	Relation	indexRelation;
 	ScanKeyData skey;
@@ -2034,8 +2034,8 @@ get_pkey_attnames(Relation rel, int16 *numatts)
 	char	  **result = NULL;
 	TupleDesc	tupdesc;
 
-	/* initialize numatts to 0 in case no primary key exists */
-	*numatts = 0;
+	/* initialize indnkeyatts to 0 in case no primary key exists */
+	*indnkeyatts = 0;
 
 	tupdesc = rel->rd_att;
 
@@ -2056,12 +2056,12 @@ get_pkey_attnames(Relation rel, int16 *numatts)
 		/* we're only interested if it is the primary key */
 		if (index->indisprimary)
 		{
-			*numatts = index->indnatts;
-			if (*numatts > 0)
+			*indnkeyatts = index->indnkeyatts;
+			if (*indnkeyatts > 0)
 			{
-result = (char **) palloc(*numatts * sizeof(char *));
+result = (char **) palloc(*indnkeyatts * sizeof(char *));
 
-for (i = 0; i < *numatts; i++)
+for (i = 0; i < *indnkeyatts; i++)
 	result[i] = SPI_fname(tupdesc, index->indkey.values[i]);
 			}
 			break;
diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c
index 7352b29..c024cf3 100644
--- a/contrib/tcn/tcn.c
+++ b/contrib/tcn/tcn.c
@@ -138,9 +138,9 @@ triggered_change_notification(PG_FUNCTION_ARGS)
 		/* we're only interested if it is the primary key and valid */
 		if (index->indisprimary && IndexIsValid(index))
 		{
-			int			numatts = index->indnatts;
+			int indnkeyatts = index->indnkeyatts;
 
-			if (numatts > 0)
+			if (indnkeyatts > 0)
 			{
 int			i;
 
@@ -150,7 +150,7 @@ triggered_change_notification(PG_FUNCTION_ARGS)
 appendStringInfoCharMacro(payload, ',');
 appendStringInfoCharMacro(payload, operation);
 
-for (i = 0; i < numatts; i++)
+for (i = 0; i < indnkeyatts; i++)
 {
 	int			colno = index->indkey.values[i];
 
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 29738b0..9c7f9e5 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -3564,6 +3564,14 @@
   pg_class.relnatts)
  
 
+  
+  indnkeyatts
+  int2
+  
+  The number of key columns in the index. "Key columns" are ordinary
+  index columns in contrast with "included" columns.
+ 
+
  
   indisunique
   bool
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..92bef4a 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@ typedef struc

[HACKERS] Misidentification of Python shared library

While chasing down last night's failure on buildfarm member longfin,
I came across an interesting factoid.  If you just do

./configure --enable-shared
make
make install

in unmodified Python sources, what you will get is an install tree in
which libpython.so (or local equivalent such as .dylib) is installed
in /usr/local/lib, while libpython.a is installed in a directory named
like /usr/local/lib/python3.5/config-3.5m.  I've verified this behavior
on both Linux and macOS.

The trouble is that we are picking up /usr/local/lib/python3.5/config-3.5m
as the python_libdir.  This means that we think we are using a python
shared library (because we see that Python was configured to build one)
but what we are actually linking with is a static library.  This happens
to work, because the code in the static library was built with -fpic,
but it results in an undesirably bloated plpython.so.

Or at least it did work until commit d51924be8.  With that, plpython.so
and hstore_plpython.so are each bloated with their very own copy of the
Python code.  And they fail to interoperate at all, which is why longfin
crashed.  I didn't investigate very hard, but it looks to me like Python
assumes that symbols like Py_None will have unique addresses, which they
don't in this scenario.

There's a comment in config/python.m4 that correctly describes what
Python is exposing:

# Note: selecting libpython from python_configdir works in all Python
# releases, but it generally finds a non-shared library, which means
# that we are binding the python interpreter right into libplpython.so.
# In Python 2.3 and up there should be a shared library available in
# the main library location.

but the code below this is not coming to the right conclusions, at least
not in Python 3.5.  What I'm seeing on macOS is that we get
"libpython3.5m.dylib" for python_ldlibrary, which seems right, but
python_so comes out as ".cpython-35m-darwin.so", whereas the code looks
like it's expecting that to be ".dylib".  The lack of match causes
the if-test just below that to make the wrong choice.  So some sort of
fix is needed here.

Also, I propose that we'd be better off to not rely on Py_ENABLE_SHARED
at all, but just to run the code in configure.in that insists on
finding something in $python_libdir that is named like a shared library,
because installation trees like this mean that Py_ENABLE_SHARED is
totally untrustworthy as a guide to what is actually in any particular
library directory.  Moreover, this stanza:

if test "$PORTNAME" = darwin; then
  # macOS does supply a .dylib even though Py_ENABLE_SHARED does
  # not get set.  The file detection logic below doesn't succeed
  # on older macOS versions, so make it explicit.
  python_enable_shared=1

is completely misguided because it presumes we are using the
Apple-supplied python library and not a hand-installed one.  I think
the similar exception for Windows is probably bogus for the same reason.

In short: I propose replacing all of this logic with "if there's something
in $python_libdir that has the right name to be a python shared library,
use that, else try the same in $python_configdir, else fail".  Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Covering + unique indexes.

On Tue, Oct 4, 2016 at 9:20 AM, Amit Kapila  wrote:
>>> I'd say that the reason for not using included columns in any
>>> operations which require comparisons, is that they don't have opclass.
>>> Let's go back to the example of points.
>>> This data type don't have any opclass for B-tree, because of fundamental
>>> reasons.
>>> And we can not apply _bt_compare() and others to this attribute, so
>>> we don't include it to scan key.
>>>
>>> create table t (i int, i2 int, p point);
>>> create index idx1 on (i) including (i2);
>>> create index idx2 on (i) including (p);
>>> create index idx3 on (i) including (i2, p);
>>> create index idx4 on (i) including (p, i2);
>>>
>>> You can keep tuples ordered in idx1, but not for idx2, partially ordered for
>>> idx3, but not for idx4.
>>
>> Yeah, I think we shouldn't go there.  I mean, once you start ordering
>> by INCLUDING columns, then you're going to need to include them in
>> leaf pages because otherwise you can't actually guarantee that they
>> are in the right order.
>
> I am not sure what you mean by above, because patch already stores
> INCLUDING columns in leaf pages.

Sorry, I meant non-leaf pages.

>>  And then you have to wonder why an INCLUDING
>> column is any different from a non-INCLUDING column.  It seems best to
>> make a firm rule that INCLUDING columns are there only for the values,
>> not for ordering.  That rule is simple and clear, which is a good
>> thing.
>
> Okay, we can make that firm rule, but I think reasoning behind that
> should be clear.  As far as I get it by reading some of the mails in
> this thread, it is because some of the other databases doesn't seem to
> support ordering for included columns or supporting the same can
> complicate the code.  One point, we should keep in mind that
> suggestion for including many other columns in INCLUDING clause to use
> Index Only scans by other databases might not hold equally good for
> PostgreSQL because it can lead to many HOT updates as non-HOT updates.

Right.  Looking back, the originally articulated rationale for this
patch was that you might want a single index that is UNIQUE ON (a) but
also INCLUDING (b) rather than two indexes, a unique index on (a) and
a non-unique index on (a, b).  In that case, the patch is a
straight-up win: you get the same number of HOT updates either way,
but you don't use as much disk space, or spend as much CPU time and
WAL updating your indexes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rename max_parallel_degree?

2016-10-04 Thread Julien Rouhaud

On 03/10/2016 21:27, Robert Haas wrote:
> On Fri, Sep 30, 2016 at 12:23 PM, Julien Rouhaud
>  wrote:
>> I've already fixed every other issues mentioned upthread, but I'm facing
>> a problem for this one.  Assuming that the bgworker classes are supposed
>> to be mutually exclusive, I don't see a simple and clean way to add such
>> a check in SanityCheckBackgroundWorker().  Am I missing something
>> obvious, or can someone give me some advice for this?
> 
> My advice is "don't worry about it".   If somebody thinks of something
> that can be usefully added there, it will take very little time to add
> it and test that it works.  Don't get hung up on that for now.
> 

Ok, thanks!

Please find attached v9 of the patch, adding the parallel worker class
and changing max_worker_processes default to 16 and max_parallel_workers
to 8.  I also added Amit's explanation for the need of a write barrier
in ForgetBackgroundWorker().

I'll add this patch to the next commitfest.

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e826c19..61c5a7c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1984,7 +1984,7 @@ include_dir 'conf.d'
 
  Sets the maximum number of background processes that the system
  can support.  This parameter can only be set at server start.  The
- default is 8.
+ default is 16.
 
 
 
@@ -2006,8 +2006,9 @@ include_dir 'conf.d'
  Sets the maximum number of workers that can be started by a single
  Gather node.  Parallel workers are taken from the
  pool of processes established by
- .  Note that the requested
- number of workers may not actually be available at run time.  If this
+ , limited by
+ .  Note that the requested
+ number of workers may not actually be available at runtime.  If this
  occurs, the plan will run with fewer workers than expected, which may
  be inefficient.  The default value is 2.  Setting this value to 0
  disables parallel query execution.
@@ -2036,6 +2037,21 @@ include_dir 'conf.d'

   
 
+  
+   max_parallel_workers (integer)
+   
+max_parallel_workers configuration parameter
+   
+   
+   
+
+ Sets the maximum number of workers that the system can support for
+ parallel queries.  The default value is 8.  Setting this value to 0
+ disables parallel query execution.
+
+   
+  
+
   
backend_flush_after (integer)

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index cde0ed3..0177401 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -453,7 +453,8 @@ LaunchParallelWorkers(ParallelContext *pcxt)
 	snprintf(worker.bgw_name, BGW_MAXLEN, "parallel worker for PID %d",
 			 MyProcPid);
 	worker.bgw_flags =
-		BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+		BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION
+		| BGWORKER_CLASS_PARALLEL;
 	worker.bgw_start_time = BgWorkerStart_ConsistentState;
 	worker.bgw_restart_time = BGW_NEVER_RESTART;
 	worker.bgw_main = ParallelWorkerMain;
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index e42ef98..6ad8fd0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -718,9 +718,11 @@ create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel)
 	}
 
 	/*
-	 * In no case use more than max_parallel_workers_per_gather workers.
+	 * In no case use more than max_parallel_workers or
+	 * max_parallel_workers_per_gather workers.
 	 */
-	parallel_workers = Min(parallel_workers, max_parallel_workers_per_gather);
+	parallel_workers = Min(max_parallel_workers, Min(parallel_workers,
+max_parallel_workers_per_gather));
 
 	/* If any limit was set to zero, the user doesn't want a parallel scan. */
 	if (parallel_workers <= 0)
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 2a49639..09dc077 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -113,6 +113,7 @@ int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 
 Cost		disable_cost = 1.0e10;
 
+int			max_parallel_workers = 8;
 int			max_parallel_workers_per_gather = 2;
 
 bool		enable_seqscan = true;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f657ffc..aa8670b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -249,6 +249,7 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
 		parse->utilityStmt == NULL &&
 		!parse->hasModifyingCTE &&
 		max_parallel_workers_per_gather > 0 &&
+		max_parallel_workers > 0 &&
 		!IsParallelWorker() &&
 		!I

Re: [HACKERS] pnstrdup considered armed and dangerous

2016-10-04 Thread Geoff Winkless

On 4 October 2016 at 14:12, Geoff Winkless  wrote:
> Well I wouldn't say it's wrong, exactly. It might produce a segfault
> if relationName[NAMEDATALEN] is outside readable memory for the
> process, but otherwise it will behave as defined.

Finger slippage. Of course I meant

... if relationName[NAMEDATALEN-1] is outside...

Geoff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Covering + unique indexes.

2016-10-04 Thread Amit Kapila

On Tue, Sep 27, 2016 at 7:51 PM, Robert Haas  wrote:
> On Mon, Sep 26, 2016 at 11:17 AM, Anastasia Lubennikova
>  wrote:
>>> Is there any fundamental problem in storing them in ordered way?  I
>>> mean to say, you need to anyway store all the column values on leaf
>>> page, so why can't we find the exact location for the complete key.
>>> Basically use truncated key to reach to leaf level and then use the
>>> complete key to find the exact location to store the key.  I might be
>>> missing some thing here, but if we can store them in ordered fashion,
>>> we can use them even for queries containing ORDER BY (where ORDER BY
>>> contains included columns).
>>
>> I'd say that the reason for not using included columns in any
>> operations which require comparisons, is that they don't have opclass.
>> Let's go back to the example of points.
>> This data type don't have any opclass for B-tree, because of fundamental
>> reasons.
>> And we can not apply _bt_compare() and others to this attribute, so
>> we don't include it to scan key.
>>
>> create table t (i int, i2 int, p point);
>> create index idx1 on (i) including (i2);
>> create index idx2 on (i) including (p);
>> create index idx3 on (i) including (i2, p);
>> create index idx4 on (i) including (p, i2);
>>
>> You can keep tuples ordered in idx1, but not for idx2, partially ordered for
>> idx3, but not for idx4.
>
> Yeah, I think we shouldn't go there.  I mean, once you start ordering
> by INCLUDING columns, then you're going to need to include them in
> leaf pages because otherwise you can't actually guarantee that they
> are in the right order.
>

I am not sure what you mean by above, because patch already stores
INCLUDING columns in leaf pages.

>  And then you have to wonder why an INCLUDING
> column is any different from a non-INCLUDING column.  It seems best to
> make a firm rule that INCLUDING columns are there only for the values,
> not for ordering.  That rule is simple and clear, which is a good
> thing.
>

Okay, we can make that firm rule, but I think reasoning behind that
should be clear.  As far as I get it by reading some of the mails in
this thread, it is because some of the other databases doesn't seem to
support ordering for included columns or supporting the same can
complicate the code.  One point, we should keep in mind that
suggestion for including many other columns in INCLUDING clause to use
Index Only scans by other databases might not hold equally good for
PostgreSQL because it can lead to many HOT updates as non-HOT updates.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pnstrdup considered armed and dangerous

2016-10-04 Thread Geoff Winkless

On 3 October 2016 at 22:55, Andres Freund  wrote:
> A colleage of me just wrote innocent looking code like
> char *shardRelationName = pnstrdup(relationName, NAMEDATALEN);
> which is at the moment wrong if relationName isn't preallocated to
> NAMEDATALEN size.
[snip]
> isn't that a somewhat weird behaviour / implementation? Not really like
> strndup(), which one might believe to be analoguous...

Well I wouldn't say it's wrong, exactly. It might produce a segfault
if relationName[NAMEDATALEN] is outside readable memory for the
process, but otherwise it will behave as defined.

Geoff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-04 Thread Masahiko Sawada

On Tue, Oct 4, 2016 at 8:29 PM, Ashutosh Bapat
 wrote:
> On Tue, Oct 4, 2016 at 1:11 PM, Amit Langote
>  wrote:
>> On 2016/10/04 16:10, Ashutosh Bapat wrote:
> Heuristics can not become the default behavior. A user should be given
> an option to choose a heuristic, and he should be aware of the
> pitfalls when using this heuristic. I guess, first, we need to get a
> solution which ensures that the transaction gets committed on all the
> servers or is rolled back on all the foreign servers involved. AFAIR,
> my patch did that. Once we have that kind of solution, we can think
> about heuristics.

 I meant that we could determine it heuristically only when remote server
 crashed in 2nd phase of 2PC.
 For example, what does the local server returns to client when no one 
 remote
 server returns OK to local server in 2nd phase of 2PC for more than
 statement_timeout seconds? Ok or error?

>>>
>>> The local server doesn't wait for the completion of the second phase
>>> to finish the currently running statement. Once all the foreign
>>> servers have responded to PREPARE request in the first phase, the
>>> local server responds to the client. Am I missing something?
>>
>> PREPARE sent to foreign servers involved in a given transaction is
>> *transparent* to the user who started the transaction, no?  That is, user
>> just says COMMIT and if it is found that there are multiple servers
>> involved in the transaction, it must be handled using two-phase commit
>> protocol *behind the scenes*.  So the aforementioned COMMIT should not
>> return to the client until after the above two-phase commit processing has
>> finished.
>
> No, the COMMIT returns after the first phase. It can not wait for all
> the foreign servers to complete their second phase

Hm, it sounds like it's same as normal commit (not 2PC).
What's the difference?

My understanding is that basically the local server can not return
COMMIT to the client until 2nd phase is completed.
Otherwise the next transaction can see data that is not committed yet
on remote server.

> , which can take
> quite long (or never) if one of the servers has crashed in between.
>
>>
>> Or are you and Sawada-san talking about the case where the user issued
>> PREPARE and not COMMIT?
>
> I guess, Sawada-san is still talking about the user issued PREPARE.
> But my comment is applicable otherwise as well.
>

Yes, I'm considering the case where the local server tries to COMMIT
but the remote server crashed after the local server completes 1st
phase (PREPARE) on the all remote server.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Renaming of pg_xlog and pg_clog

2016-10-04 Thread David Steele

On 10/4/16 1:48 AM, Michael Paquier wrote:

> So this is still open for votes. Here are the candidates and who
> voiced for what:
> - pg_transaction: Michael P, Thomas M. => Current 0002 is doing that.
> - pg_xact: David S, Robert
> - pg_trans: Tom
> - pg_transaction_status: Peter E.

Christoph also prefers pg_xact [1].

-- 
-David
da...@pgmasters.net

[1]
https://www.postgresql.org/message-id/20161003141959.qhvd6roesj4kp...@msg.df7cb.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pnstrdup considered armed and dangerous

On Mon, Oct 3, 2016 at 5:55 PM, Andres Freund  wrote:
> /*
>  * pnstrdup
>  *  Like pstrdup(), but append null byte to a
>  *  not-necessarily-null-terminated input string.
>  */
> char *
> pnstrdup(const char *in, Size len)
> {
> char   *out = palloc(len + 1);
>
> memcpy(out, in, len);
> out[len] = '\0';
> return out;
> }
>
> isn't that a somewhat weird behaviour / implementation? Not really like
> strndup(), which one might believe to be analoguous...

Yikes!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Cardinality estimation for group by

2016-10-04 Thread Chenxi Li

Friends,

How is cardinality estimation for "group by" is done and where is the code
doing that?

Best Regards,
Chenxi Li

Re: [HACKERS] asynchronous execution

2016-10-04 Thread Amit Khandekar

On 4 October 2016 at 02:30, Robert Haas  wrote:
> On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar  
> wrote:
>> On 24 September 2016 at 06:39, Robert Haas  wrote:
>>> Since Kyotaro Horiguchi found that my previous design had a
>>> system-wide performance impact due to the ExecProcNode changes, I
>>> decided to take a different approach here: I created an async
>>> infrastructure where both the requestor and the requestee have to be
>>> specifically modified to support parallelism, and then modified Append
>>> and ForeignScan to cooperate using the new interface.  Hopefully that
>>> means that anything other than those two nodes will suffer no
>>> performance impact.  Of course, it might have other problems
>>
>> I see that the reason why you re-designed the asynchronous execution
>> implementation is because the earlier implementation showed
>> performance degradation in local sequential and local parallel scans.
>> But I checked that the ExecProcNode() changes were not that
>> significant as to cause the degradation.
>
> I think we need some testing to prove that one way or the other.  If
> you can do some - say on a plan with multiple nested loop joins with
> inner index-scans, which will call ExecProcNode() a lot - that would
> be great.  I don't think we can just rely on "it doesn't seem like it
> should be slower"
Agreed. I will come up with some tests.

> , though - ExecProcNode() is too important a function
> for us to guess at what the performance will be.

Also, parent pointers are not required in the new design. Thinking of
parent pointers, now it seems the event won't get bubbled up the tree
with the new design. But still, , I think it's possible to switch over
to the other asynchronous tree when some node in the current subtree
is waiting. But I am not sure, will think more on that.

>
> The thing I'm really worried about with either implementation is what
> happens when we start to add asynchronous capability to multiple
> nodes.  For example, if you imagine a plan like this:
>
> Append
> -> Hash Join
>   -> Foreign Scan
>   -> Hash
> -> Seq Scan
> -> Hash Join
>   -> Foreign Scan
>   -> Hash
> -> Seq Scan
>
> In order for this to run asynchronously, you need not only Append and
> Foreign Scan to be async-capable, but also Hash Join.  That's true in
> either approach.  Things are slightly better with the original
> approach, but the basic problem is there in both cases.  So it seems
> we need an approach that will make adding async capability to a node
> really cheap, which seems like it might be a problem.

Yes, we might have to deal with this.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical tape pause/resume


On 10/04/2016 02:09 PM, Simon Riggs wrote:

On 4 October 2016 at 11:47, Heikki Linnakangas  wrote:


When we finish writing an initial run to a tape, we keep the tape buffers
around. That's the reason we need to reserve that memory. But there's no
fundamental reason we have to do that. As I suggested in [2], we could flush
the buffers to disk, and only keep in memory the location of the last block
we wrote. If we need to write another run to the tape, we can reload the
last incomplete block from the disk, and continue writing.

Reloading the last block, requires an extra I/O. That's OK. It's quite rare
to have a multi-pass sort these days, so it's a good bet that you don't need
to do it. And if you have a very small work_mem, so that you need to do a
multi-pass sort, having some more memory available for building the initial
runs is probably worth it, and there's also a good chance that the block is
still in the OS cache.


Sounds like a good idea.

Why not just make each new run start at a block boundary?
That way we waste on average BLCKSZ/2 disk space per run, which is
negligible but we avoid any need to have code to read back in the last
block.


Hmm. You'd still have to read back the last block, so that you can 
update its next-pointer.


What you could do, though, is to always store only one run on each tape. 
If we can make the in-memory representation of a tape small enough, that 
might be acceptable. You only need 4 bytes to store the starting block 
number. Or we could dump the starting block numbers of the tapes, to yet 
another tape. The number of tapes you merge in one pass would still be 
limited by the amount of memory you have, but the number of tapes would 
be unlimited.


I don't want to do that right now, it gets more invasive. Something to 
keep in mind for the future, though. This also related to your other 
suggestion below:



That also makes it easier to put each run in its own file, which will
surely help with parallelism and compression since we can spread
multiple run files across multiple temp tablespaces.

Can we also please read in the value from the last tuple in a run, so
we have min and max values in memory for each run? That can be used
during the merge step to avoid merging runs unless the value ranges
overlap.


This gets more advanced... Yeah, stuff like that would make sense. You 
could also split the tapes into smaller parts, and store the min/max for 
each part, to make the merging even more efficient and easier to 
parallelize. Or to take that to the extreme, you could make each tape a 
little B-tree.


But that's for another day and another patch :-). What I have now is a 
drop-in replacement for the current logical tape implementation. These 
more advanced schemes would build on top of that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pgpassfile connection option

2016-10-04 Thread Julian Markwort


On 09/26/2016 07:51 PM, Robert Haas wrote:

However, they don't have
to accept the possibility that arbitrary local files readable by the
user ID will be used for authentication and/or disclosed; this patch
would force them to accept that risk.
I do agree with you, however we might have to take a look at the 
parameter sslkey's implementation here as well - There are no checks in 
place to stop you from using rogue sslkey parameters.
I'd like to suggest having both of these parameters behave in a similar 
fashion. In order to achieve safe behaviour, we could implement the use 
of environment variables prohibiting the use of user-located pgpassfiles 
and sslkeys.

How about PGSECRETSLOCATIONLOCK ?


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2016-10-04 Thread Etsuro Fujita


On 2016/09/30 1:09, Ashutosh Bapat wrote:

You wrote:

The attached patch adds the
dependencies from create_foreignscan_plan() which is called for any
foreign access. I have also added testcases to test the functionality.
Let me know your comments on the patch.


I wrote:

Hmm.  I'm not sure that that's a good idea.

I was thinking the changes to setrefs.c proposed by Amit to collect that
dependencies would be probably OK, but I wasn't sure that it's a good idea
that he used PlanCacheFuncCallback as the syscache inval callback function
for the foreign object caches because it invalidates not only generic plans
but query trees, as you mentioned downthread.  So, I was thinking to modify
his patch so that we add a new syscache inval callback function for the
caches that is much like PlanCacheFuncCallback but only invalidates generic
plans.



PlanCacheFuncCallback() invalidates the query tree only when
invalItems are added to the plan source. The patch adds the
dependencies in root->glob->invalItems, which standard_planner()
copies into PlannedStmt::invalItems. This is then copied into the
gplan->stmt_list. Thus PlanCacheFuncCallback never invalidates the
query tree. I have verified this under the debugger. Am I missing
something?


I think you are right.  Maybe my explanation was not enough, sorry for 
that.  I just wanted to mention that Amit's patch would invalidate the 
generic plan as well as the rewritten query tree.


I looked at your patch closely.  You added code to track dependencies on 
FDW-related objects to createplan.c, but I think it would be more 
appropriate to put that code in setrefs.c like the attached.  I think 
record_foreignscan_plan_dependencies in your patch would be a bit 
inefficient because that tracks such dependencies redundantly in the 
case where the given ForeignScan has an outer subplan, so I optimized 
that in the attached.  (Also some regression tests for that were added.)


I think it would be a bit inefficient to use PlanCacheFuncCallback as 
the inval callback function for those caches, because that checks the 
inval-item list for the rewritten query tree, but any updates on such 
catalogs wouldn't affect the query tree.  So, I added a new callback 
function for those caches that is much like PlanCacheFuncCallback but 
skips checking the list for the query tree.  I updated some comments also.


Best regards,
Etsuro Fujita


foreign_plan_cache_inval_v2.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2016-10-04 Thread Ashutosh Bapat

On Tue, Oct 4, 2016 at 1:11 PM, Amit Langote
 wrote:
> On 2016/10/04 16:10, Ashutosh Bapat wrote:
 Heuristics can not become the default behavior. A user should be given
 an option to choose a heuristic, and he should be aware of the
 pitfalls when using this heuristic. I guess, first, we need to get a
 solution which ensures that the transaction gets committed on all the
 servers or is rolled back on all the foreign servers involved. AFAIR,
 my patch did that. Once we have that kind of solution, we can think
 about heuristics.
>>>
>>> I meant that we could determine it heuristically only when remote server
>>> crashed in 2nd phase of 2PC.
>>> For example, what does the local server returns to client when no one remote
>>> server returns OK to local server in 2nd phase of 2PC for more than
>>> statement_timeout seconds? Ok or error?
>>>
>>
>> The local server doesn't wait for the completion of the second phase
>> to finish the currently running statement. Once all the foreign
>> servers have responded to PREPARE request in the first phase, the
>> local server responds to the client. Am I missing something?
>
> PREPARE sent to foreign servers involved in a given transaction is
> *transparent* to the user who started the transaction, no?  That is, user
> just says COMMIT and if it is found that there are multiple servers
> involved in the transaction, it must be handled using two-phase commit
> protocol *behind the scenes*.  So the aforementioned COMMIT should not
> return to the client until after the above two-phase commit processing has
> finished.

No, the COMMIT returns after the first phase. It can not wait for all
the foreign servers to complete their second phase, which can take
quite long (or never) if one of the servers has crashed in between.

>
> Or are you and Sawada-san talking about the case where the user issued
> PREPARE and not COMMIT?

I guess, Sawada-san is still talking about the user issued PREPARE.
But my comment is applicable otherwise as well.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Incorrect comment/doc for poll_query_until

2016-10-04 Thread Daniel Gustafsson

Commit 2a0f89cd717ce6d raised the timeout in poll_query_until() to 180 seconds
but left the documentation and comments at the previous value of 90 sec.  Patch
attached to update documentation and comment to match reality.

cheers ./daniel



poll_query_until_timeout_doc.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Simon Riggs

On 4 October 2016 at 11:47, Heikki Linnakangas  wrote:

> When we finish writing an initial run to a tape, we keep the tape buffers
> around. That's the reason we need to reserve that memory. But there's no
> fundamental reason we have to do that. As I suggested in [2], we could flush
> the buffers to disk, and only keep in memory the location of the last block
> we wrote. If we need to write another run to the tape, we can reload the
> last incomplete block from the disk, and continue writing.
>
> Reloading the last block, requires an extra I/O. That's OK. It's quite rare
> to have a multi-pass sort these days, so it's a good bet that you don't need
> to do it. And if you have a very small work_mem, so that you need to do a
> multi-pass sort, having some more memory available for building the initial
> runs is probably worth it, and there's also a good chance that the block is
> still in the OS cache.

Sounds like a good idea.

Why not just make each new run start at a block boundary?
That way we waste on average BLCKSZ/2 disk space per run, which is
negligible but we avoid any need to have code to read back in the last
block.

That also makes it easier to put each run in its own file, which will
surely help with parallelism and compression since we can spread
multiple run files across multiple temp tablespaces.

Can we also please read in the value from the last tuple in a run, so
we have min and max values in memory for each run? That can be used
during the merge step to avoid merging runs unless the value ranges
overlap.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Logical tape pause/resume

One of the patches in Peter Geoghegan's Parallel tuplesort patch set [1] 
is to put a cap on the number of tapes to use for the sort. The reason 
is that each tape consumes 3 * BLCKSZ worth of memory, for the buffers. 
We decide the max number of tapes upfront, so if we decide that e.g. the 
max number of tapes is 1000, we reserve 1000 * 3 * BLCKSZ bytes of 
memory for the buffers, and that memory cannot then be used for the 
quicksorts to build the initial runs, even if it turns out later that we 
needed only a few tapes.


That amounts to about 8% of the available memory. That's quite wasteful. 
Peter's approach of putting an arbitrary cap on the max number of tapes 
works, but I find it a bit hackish. And you still waste that 8% with 
smaller work_mem settings.


When we finish writing an initial run to a tape, we keep the tape 
buffers around. That's the reason we need to reserve that memory. But 
there's no fundamental reason we have to do that. As I suggested in [2], 
we could flush the buffers to disk, and only keep in memory the location 
of the last block we wrote. If we need to write another run to the tape, 
we can reload the last incomplete block from the disk, and continue writing.


Reloading the last block, requires an extra I/O. That's OK. It's quite 
rare to have a multi-pass sort these days, so it's a good bet that you 
don't need to do it. And if you have a very small work_mem, so that you 
need to do a multi-pass sort, having some more memory available for 
building the initial runs is probably worth it, and there's also a good 
chance that the block is still in the OS cache.


So, here are two patches to that end:

1. The first patch changes the way we store the logical tapes on disk. 
Instead of using indirect blocks, HFS style, the blocks form a linked 
list. Each block has a trailer, with the block numbers of the previous 
and next block of the tape. That eliminates the indirect blocks, which 
simplifies the code quite a bit, and makes it simpler to implement the 
pause/resume functionality in the second patch. It also immediately 
reduces the memory needed for the buffers, from 3 to 1 block per tape, 
as we don't need to hold the indirect blocks in memory.


2. The second patch adds a new function LogicalTapePause(). A paused 
tape can be be written to, but it doesn't have an in-memory buffer. The 
last, incomplete block is written to disk, but if you call 
LogicalTapeWrite(), a new buffer is allocated and the last block is read 
back into memory. If you don't write to the tape anymore, and call 
LogicalTapeRewind() after LogicalTapePause(), nothing needs to be done, 
as the last block is already on disk.


In addition to saving a little bit of memory, I'd like to do this 
refactoring because it simplifies the code. It's code that has stayed 
largely unchanged for the past 15 years, so I'm not too eager to touch 
it, but looking at the changes coming with Peter's parallel tuplesort 
patch set, I think this makes sense. The parallel tuplesort patch set 
adds code for "serializing" and "deserializing" a tape, which means 
writing the top indirect block to disk, so that it can be read back in 
in the leader process. That's awfully similar to the "pause/resume" 
functionality here, so by adding it now, we won't have to do it in the 
parallel tuplesort patch. (Admittedly, it's not a whole lot of code, but 
still.)


[1] CAM3SWZQKM=Pzc=cahzrixkjp2eo5q0jg1sofqqexfq647ji...@mail.gmail.com
[2] e8f44b63-4745-b855-7772-e8201906a...@iki.fi

- Heikki
>From 5fd57472432ed0641aaa5552f0486e459d03021e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 21 Sep 2016 17:20:57 +0300
Subject: [PATCH 1/2] Simplify tape block format.

No more indirect blocks. The blocks form a linked list instead.

This saves some memory, because we don't need to have a buffer in memory to
hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD
is reduced from 3 to 1 buffer, which allows using more memory for building
the initial runs.
---
 src/backend/utils/sort/logtape.c   | 552 +
 src/backend/utils/sort/tuplesort.c |  17 +-
 src/include/utils/logtape.h|   2 +-
 3 files changed, 139 insertions(+), 432 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index caa6960..cc58408 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -31,15 +31,8 @@
  * in BLCKSZ-size blocks.  Space allocation boils down to keeping track
  * of which blocks in the underlying file belong to which logical tape,
  * plus any blocks that are free (recycled and not yet reused).
- * The blocks in each logical tape are remembered using a method borrowed
- * from the Unix HFS filesystem: we store data block numbers in an
- * "indirect block".  If an indirect block fills up, we write it out to
- * the underlying file and remember its location in a second-level indirect
- * block.  In the same way second-l

Re: [HACKERS] Question / requests.

2016-10-04 Thread Francisco Olarte

On Mon, Oct 3, 2016 at 11:44 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> On Fri, Sep 30, 2016 at 11:20 AM, Francisco Olarte
>> What messages are you seeing, exactly? "auto-deadlocking" isn't a thing.
> https://www.postgresql.org/message-id/57EBC9AE.2060302%40163.com

Besides that even the docs for -j state "Note that using this mode
together with the -f (FULL) option might cause deadlock failures if
certain system catalogs are processed in parallel."

So the only "safe" way to do -j -f seems to be using a table list. My
proposed patch just makes it easy to build that by doing schema
filtering.

> I wonder if the real answer isn't just to disallow -f with parallel
> vacuuming.

It may be. I do not feel it is necessary, the docs are clear, this may
be like disallowing knifes because you can cut yourself. IMO vacuumdb
-f and -j are for people who know what they are doing, a warning may
be nice anyway.

Anyway, even if the combo is disallowed I feel schema filtering has
its use. As an example, in some of my systems I have CDR tables
partitioned by timestamp, either monthly or other period. As most of
the data does not change I routinely coalesce many partitions and move
them to a historic schema ( like I make a yearly partition and zap
monthly ones, I still inherit from it ). This let me dump the historic
schema when I change it and dump without the historic schema daily,
greatly reducing dump times as the historic schema typically contains
>90% of the data and it only changes when I have to do a back-fix to
the data, which is very rare. Being able to do the same thing with
vacuumdb could be useful.

So, I'll just follow on executing my plan, but I'm prepared to abort
it anytime if people feel it does not hold its weight. Most of the
work is going to be learning how to submit a patch so it is reusable
for me.

Francisco Olarte.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

2016-10-04 Thread Dean Rasheed

On 4 October 2016 at 09:15, Heikki Linnakangas  wrote:
> However, for tables and views, each object you store in those views is a
> "table" or "view", but with this thing, the object you store is
> "statistics". Would you have a catalog table called "pg_scissor"?
>

No, probably not (unless it was storing individual scissor blades).

However, in this case, we have related pre-existing catalog tables, so...

> We call the current system table "pg_statistic", though. I agree we should
> call it pg_mv_statistic, in singular, to follow the example of pg_statistic.
>
> Of course, the user-friendly system view on top of that is called
> "pg_stats", just to confuse things more :-).
>

I agree. Given where we are, with a pg_statistic table and a pg_stats
view, I think the least worst solution is to have a pg_statistic_ext
table, and then maybe a pg_stats_ext view.


>> It doesn't seem like the end of the world that it doesn't
>> match the user-facing syntax. A bigger concern is the use of "mv" in
>> the name, because as has already been pointed out, this table may also
>> in the future be used to store univariate expression and partial
>> statistics, so I think we should drop the "mv" and go with something
>> like pg_statistic_ext, or some other more general name.
>
>
> Also, "mv" makes me think of materialized views, which is completely
> unrelated to this.
>

Yeah, I hadn't thought of that.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Should we fix postmaster to avoid slow shutdown?

From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
FWIW, I'm pretty much -1 on messing with the timing of the socket close
> actions.  I broke that once within recent memory, so maybe I'm gun-shy,
> but I think that the odds of unpleasant side effects greatly outweigh any
> likely benefit there.

I couldn't find any relevant mails in pgsql-hackers.  I found no problem with 
the attached patch.  Do you think this is OK?

Regards
Takayuki Tsunakawa




02_close_listen_ports_early.patch
Description: 02_close_listen_ports_early.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

2016-10-04 Thread Gavin Flower

On 04/10/16 20:37, Dean Rasheed wrote:

On 4 October 2016 at 04:25, Michael Paquier wrote:

OK. A second thing was related to the use of schemas in the new system
catalogs. As mentioned in [1], those could be removed.
[1]:
https://www.postgresql.org/message-id/cab7npqtu40q5_nsghvomjfbyh1hdtqmbfdj+kwfjspam35b...@mail.gmail.com.

That doesn't work, because if the intention is to be able to one day
support statistics across multiple tables, you can't assume that the
statistics are in the same schema as the table.

In fact, if multi-table statistics are to be allowed in the future, I
think you want to move away from thinking of statistics as depending
on and referring to a single table, and handle them more like views --
i.e, store a pg_node_tree representing the from_clause and add
multiple dependencies at statistics creation time. That was what I was
getting at upthread when I suggested the alternate syntax, and also
answers Tomas' question about how JOIN might one day be supported.

Of course, if we don't think that we will ever support multi-table
statistics, that all goes away, and you may as well make the
statistics name local to the table, but I think that's a bit limiting.
One way or the other, I think this is a question that needs to be
answered now. My vote is to leave expansion room to support
multi-table statistics in the future.

Regards,
Dean

I can see multi-table statistics being useful if one is trying to
optimise indexes for multiple joins.

Am assuming that the statistics can be accessed by the user as well as
the planner? (I've only lightly followed this thread, so I might have
missed, significant relevant details!)

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2016-10-04 Thread Gavin Flower


On 04/10/16 20:15, Michael Paquier wrote:

On Mon, Oct 3, 2016 at 11:52 PM, Daniel Verite  wrote:

Wouldn't pgbench benefit from it?
It was mentioned some time ago [1], in relationship to the
\into construct, how client-server latency was important enough to
justify the use of a "\;" separator between statements, to send them
as a group.

But with the libpq batch API, maybe this could be modernized
with meta-commands like this:
   \startbatch
   ...
   \endbatch

Or just \batch [on|off], which sounds like a damn good idea to me for
some users willing to test some workloads before integrating it in an
application.


+1

'\batch' is a bit easier, to find, & to remember than '\startbatch'



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2016-10-04 Thread Craig Ringer

On 4 Oct. 2016 15:15, "Michael Paquier"  wrote:
>
> On Mon, Oct 3, 2016 at 11:52 PM, Daniel Verite 
wrote:
> > Wouldn't pgbench benefit from it?
> > It was mentioned some time ago [1], in relationship to the
> > \into construct, how client-server latency was important enough to
> > justify the use of a "\;" separator between statements, to send them
> > as a group.
> >
> > But with the libpq batch API, maybe this could be modernized
> > with meta-commands like this:
> >   \startbatch
> >   ...
> >   \endbatch
>
> Or just \batch [on|off], which sounds like a damn good idea to me for
> some users willing to test some workloads before integrating it in an
> application.

A batch jsnt necessarily terminated by a commit, so I'm more keen on
start/end batch. It's more in line with begin/commit. Batch is not only a
mode, you also have to delineate batches.

Re: [HACKERS] [PATCH] Generic type subscription

2016-10-04 Thread Victor Wagner

On Sat, 1 Oct 2016 16:52:34 +0700
Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > I've tried to compile this patch with current state of master
> > (commit 51c3e9fade76c12)  and found out that, when configured with  
> --enable-cassert,
> > it doesn't pass make check.  
> 
> Thanks for the feedback. Yes, unexpectedly for me, `ExecEvalExpr` can
> return expanded `jbvArray` and `jbvObject` instead `jbvBinary` in
> both cases. It's interesting, that this doesn't break anything, but
> obviously violates the `pushJsonbValueScalar` semantics. I don't
> think `ExecEvalExpr` should be changed for jsonb, we can handle this
> situation in `pushJsonbValue` instead. I've
> attached a new version of patch with this modification.

Thanks for you quick responce. I've not yet thoroughly investigated new
patch, but already noticed some minor promlems:

Git complains about whitespace in this version of patch:

$ git apply ../generic_type_subscription_v2.patch
../generic_type_subscription_v2.patch:218: tab in indent.
int first;
../generic_type_subscription_v2.patch:219: tab in indent.
int second;
../generic_type_subscription_v2.patch:225: tab in indent.
SubscriptionExecData*sbsdata = (SubscriptionExecData *) 
PG_GETARG_POINTER(1);
../generic_type_subscription_v2.patch:226: tab in indent.
Custom  *result = (Custom *) 
sbsdata->containerSource;
../generic_type_subscription_v2.patch:234: tab in indent.
SubscriptionRef*sbsref = (SubscriptionRef *) PG_GETARG_POINTER(0);
warning: squelched 29 whitespace errors
warning: 34 lines add whitespace errors.

It doesn't prevent me from further testing of patch, but worth noticing.

-- 

Victor

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)