Re: [HACKERS] extend pgbench expressions with functions

2015-12-17 Thread Fabien COELHO


Hello Michael,

Thanks for your remarks.


+  double constants such as 3.14156,
You meant perhaps 3.14159 :)


Indeed!


+   max(i,
...)
+   integer
Such function declarations are better with optional arguments listed
within brackets.


Why not. I did it that way because this is the standard C syntax for 
varargs.



+  
+   double(i)
+   double
+   cast to double
+   double(5432)
+   5432.0
+  
+  
+   int(x)
+   integer
+   cast to int
+   int(5.4 + 3.8)
+   9
+  
If there are cast functions, why doesn't this patch introduces the
concept of the data type for a variable defined in a script?


Because this would be a pain and this is really useless as far as pgbench 
scripts are concerned, it really just needs integers.


Both concepts are linked, and for now as I can see the allocation of a 
\set variable is actually always an integer.


Yes.


In consequence, sqrt behavior is a bit surprising, for example this script:
\set aid sqrt(2.0)
SELECT :aid;
returns that:
SELECT 1;
Shouldn't a user expect 1.414 instead? Fabien, am I missing a trick?


There is no trick. There are only integer variables in pgbench, so 1.4 is 
rounded to 1 by the assignment.



It seems to me that this patch would gain in clarity by focusing a bit
more on the infrastructure first and remove a couple of things that
are not that mandatory for now...


This is more or less what I did in the beginning, and then someone said 
"please show a complete infra with various examples to show that the infra 
is really extensible". Now you say the reverse. This is *tiring* and is 
not a good use of the little time I can give.



So the following things are not necessary as of now:



- cast functions from/to int/double, because a result variable of a
\set does not include the idea that a result variable can be something
else than an integer. At least no options is given to the user to be
able to make a direct use of a double value.
- functions that return a double number: sqrt, pi
- min and max have value because they allow a user to specify the
expression once as a variable instead of writing it perhaps multiple
times in SQL, still is it enough to justify having them as a set of
functions available by default? I am not really sure either.


AFAICR this is because I was *ASKED* to show an infra which would deal 
with various cases: varargs, double/int functions/operators, overloading, 
so I added some example in each category, hence the current list of 
functions in the patch.



Really, I like this patch, and I think that it is great to be able to
use a set of functions and methods within a pgbench script because
this clearly can bring more clarity for a developer, still it seems
that this patch is trying to do too many things at the same time:
1) Add infrastructure to support function calls and refactor the
existing functions to use it. This is the FUNCTION stuff in the
expression scanner.
2) Add the concept of double return type, this could be an extension
of \set with a data type, or a new command like \setdouble. This
should include as well the casting routines I guess. This is the
DOUBLE stuff in the expression scanner.
3) Introduce new functions, as needed.


Yep. I started with (1), and was *ASKED* to do the others.

Adding double variables in pretty useless in my opinion, and potentially 
lead to issues in various places, so I'm not in a hurry to do that.



1) and 2) seem worthwhile to me. 3) may depend on the use cases we'd
like to have.. sqrt has for example value if a variable can be set
double as a double type.


Sure. This is just an example of a double function so that if someone 
wants to add another one they can just update where it make sense. Maybe 
log/exp would make more sense for pgbench.



In conclusion, for this CF the patch doing the documentation fixes is
"Ready for committer", the second patch introducing the function
infrastructure should focus more on its core structure and should be
marked as "Returned with feedback". Opinions are welcome.


My opinion is that to do and unddo work because of random thoughts on the 
list is tiring.


What I can do is:

(1) fix the doc and bugs if any, obviously.

(2a) remove double stuff, just keep integer functions.
 I would rather keep min/max, though.

(2b) keep the patch with both int & double as is.

What I will *NOT* do is to add double variables without a convincing use 
case.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] A question regarding LWLock in ProcSleep

2015-12-17 Thread Kenan Yao
Hi there,

In function ProcSleep, after the process has been waken up, either with
lock granted or deadlock detected, it would re-acquire the lock table's
partition LWLock.

The code episode is here:

/*
 * Re-acquire the lock table's partition lock.  We have to do this to hold
 * off cancel/die interrupts before we can mess with lockAwaited (else we
 * might have a missed or duplicated locallock update).
 */
LWLockAcquire(partitionLock, LW_EXCLUSIVE);

/*
 * We no longer want LockErrorCleanup to do anything.
 */
lockAwaited = NULL;

/*
 * If we got the lock, be sure to remember it in the locallock table.
 */
if (MyProc->waitStatus == STATUS_OK)
GrantAwaitedLock();

/*
 * We don't have to do anything else, because the awaker did all the
 * necessary update of the lock table and MyProc.
 */
return MyProc->waitStatus;

​
Questions are:

(1) The comment says that "we might have a missed or duplicated locallock
update", in what cases would we hit this if without holding the LWLock?

(2) The comment says "we have to do this to hold off cancel/die
interrupts", then:

   - why using LWLockAcquire instead of HOLD_INTERRUPTS directly?
   - From the handler of SIGINT and SIGTERM, seems nothing serious would be
   processed here, since no CHECK_FOR_INTERRUPS is called before releasing
   this LWLock. Why we should hold off cancel/die interrupts here?

(3) Before releasing this LWLock, the only share memory access is
MyProc->waitStatus; since the process has been granted the lock or removed
from the lock's waiting list because of deadlock, is it possible some other
processes would access this field? if not, then why we need LWLock here?
what does this lock protect?

Apologize if I missed anything here, and appreciate if someone can help me
on this question. Thanks

Cheers,
Kenan


Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-17 Thread Thomas Munro
On Thu, Dec 17, 2015 at 7:24 PM, Michael Paquier
 wrote:
> On Thu, Dec 17, 2015 at 3:06 PM, Kyotaro HORIGUCHI
>  wrote:
>> At Mon, 14 Dec 2015 14:13:02 +0900, Michael Paquier 
>>  wrote in 
>> 
>>> On Mon, Dec 14, 2015 at 8:10 AM, Thomas Munro 
>>>  wrote:
>>> Yeah, I guess that's an improvement for those cases, and there is no
>>> immediate need for a per-keyword NOT operator in our cases to allow
>>> things of the type (foo1 OR NOT foo2). Still, in the case of this
>>> patch !foo1|foo2 actually means (NOT foo1 AND NOT foo2). It does not
>>> seem that much intuitive. Reading straight this expression it seems
>>> that !foo1|foo2 means actually (NOT foo1 OR foo2) because the lack of
>>> parenthesis. Thoughts?
>>
>> I used just the same expression as Thomas in my patch since it
>> was enough intuitive in this context in my view. The expression
>> "(!FOO1)|FOO2" is a nonsence in the context of tab-completion and
>> won't be used in future. But it is true that "!(FOO|BAR|BAZ)" is
>> clearer than without the parenthesis.
>
> Yeah that's my whole point. If we decide to support that I think that
> we had better go all the way through it, with stuff like:
> - & for AND operator
> - Support of parenthesis to prioritize expressions
> - ! for NOT operator
> - | for OR OPERATOR
> But honestly at this stage this would just benefit 5 places (citing
> Thomas' quote), hence let's just go to the most simple approach which
> is the one without all this fancy operator stuff... That will be less
> bug prone, and still benefits more than 95% of the tab completion code
> path. At least I think that's the most realistic thing to do if we
> want to get something for this commit fest. If someone wants those
> operators, well I guess that he could add them afterwards.. Thomas,
> what do you think?

I agree with both of you that using "!" without parentheses is not
ideal.  I also don't think it's worth trying to make a clever language
here: in future it will be a worthy and difficult project to do
something far cleverer based on the productions of the real grammar as
several people have said.  (Presumably with some extra annotations to
enable/disable productions and mark out points where database object
names are looked up?).  In the meantime, I think we should just do the
simplest thing that will replace the repetitive strcasecmp-based code
with something readable/writable, and avoid the
fully-general-pattern-language rabbit hole.

Kyotaro's suggestion of using a macro NEG x to avoid complicating the
string constants seems good to me.  But perhaps like this?

  TailMatches4("COMMENT", "ON", MatchAny, MatchAnyExcept("IS"))

See attached patch which does it that way.

>> Addition to that, I feel that successive "MatchAny"s are a bit
>> bothersome.
>>
>>>  TailMatches6("COMMENT", "ON", MatchAny, MatchAny, MatchAny, MatchAny)) &&
>>>  !TailMatches1("IS")
>>
>> Is MachAny acceptable? On concern is the two n's
>> (TailMatches and MatchAny) looks a bit confising.
>>
>>>  TailMatches4("COMMENT", "ON", MatchAny3, "!IS")
>
> Ah, OK, so you would want to be able to have an inner list, MatchAnyN
> meaning actually a list of MatchAny items repeated N times. I am not
> sure if that's worth it.. I would just keep it simple, and we are just
> discussing about a couple of places only that would benefit from that.

+1 for simple.

-- 
Thomas Munro
http://www.enterprisedb.com


tab-complete-macrology-v11.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Getting sorted data from foreign server for merge join

2015-12-17 Thread Ashutosh Bapat
On Wed, Dec 9, 2015 at 12:14 AM, Robert Haas  wrote:

> On Wed, Dec 2, 2015 at 6:45 AM, Rushabh Lathia 
> wrote:
> > Thanks Ashutosh.
> >
> > Re-reviewed and Re-verified the patch, pg_sort_all_pd_v5.patch
> > looks good to me.
>
> This patch needs a rebase.
>

Done.


>
> It's not going to work to say this is a patch proposed for commit when
> it's still got a TODO comment in it that obviously needs to be
> changed.   And the formatting of that long comment is pretty weird,
> too, and not consistent with other functions in that same file (e.g.
> get_remote_estimate, ec_member_matches_foreign, create_cursor).
>
>
The TODO was present in v4 but not in v5 and is not present in v6 attached
here.. Formatted comment according estimate_path_cost_size(),
convert_prep_stmt_params().


> Aside from that, I think before we commit this, somebody should do
> some testing that demonstrates that this is actually a good idea.  Not
> as part of the test case set for this patch, but just in general.
> Merge joins are typically going to be relevant for large tables, but
> the examples in the regression tests are necessarily tiny.  I'd like
> to see some sample data and some sample queries that get appreciably
> faster with this code.  If we can't find any, we don't need the code.
>
>
I tested the patch on my laptop with two types of queries, a join between
two foreign tables on different foreign servers (pointing to the same self
server) and a join between one foreign and one local table. The foreign
tables and servers are created using sort_pd_setup.sql attached. Foreign
tables pointed to table with index useful for join clause. Both the joining
tables had 10M rows. The execution time of query was measured for 100 runs
and average and standard deviation were calculated (using function
query_execution_stats() in script sort_pd.sql) and are presented below.

1. Query between foreign tables
SELECT ft1.val, ft2.val FROM ft1 join ft2 on (ft1.val = ft2.val)

Plan and timings without patch
EXPLAIN (VERBOSE, ANALYSE) :query ;
 QUERY
PLAN
-
 Hash Join  (cost=508510.02..1129945.94 rows=95 width=8) (actual
time=33803.826..82416.342 rows=1000 loops=1)
   Output: ft1.val, ft2.val
   Hash Cond: (ft1.val = ft2.val)
   ->  Foreign Scan on public.ft1  (cost=100.00..344347.31 rows=977
width=4) (actual time=0.624..28531.803 rows=1000 loops=1)
 Output: ft1.val
 Remote SQL: SELECT val FROM public.lt
   ->  Hash  (cost=344347.31..344347.31 rows=977 width=4) (actual
time=33258.025..33258.025 rows=1000 loops=1)
 Output: ft2.val
 Buckets: 131072  Batches: 256  Memory Usage: 2400kB
 ->  Foreign Scan on public.ft2  (cost=100.00..344347.31
rows=977 width=4) (actual time=22.171..28134.970 rows=1000 loops=1)
   Output: ft2.val
   Remote SQL: SELECT val FROM public.lt
 Planning time: 33.155 ms
 Execution time: 82914.607 ms
(14 rows)

 avg_exe_time | std_dev_exe_time | min_exe_time | max_exe_time
--+--+--+--
  78750.95487 | 2911.51825687913 |74314.886 |89358.464

Plan and timing with patch
EXPLAIN (VERBOSE, ANALYSE) :query ;
 QUERY
PLAN
-
 Merge Join  (cost=200.86..1183070.86 rows=1000 width=8) (actual
time=1.776..73140.219 rows=1000 loops=1)
   Output: ft1.val, ft2.val
   Merge Cond: (ft1.val = ft2.val)
   ->  Foreign Scan on public.ft1  (cost=100.43..504035.43 rows=1000
width=4) (actual time=0.937..30422.457 rows=1000 loops=1)
 Output: ft1.val, ft1.val2
 Remote SQL: SELECT val FROM public.lt ORDER BY val ASC
   ->  Materialize  (cost=100.43..529035.43 rows=1000 width=4) (actual
time=0.826..33448.822 rows=1000 loops=1)
 Output: ft2.val, ft2.val2
 ->  Foreign Scan on public.ft2  (cost=100.43..504035.43
rows=1000 width=4) (actual time=0.818..31035.362 rows=1000 loops=1)
   Output: ft2.val, ft2.val2
   Remote SQL: SELECT val FROM public.lt ORDER BY val ASC
 Planning time: 163.161 ms
 Execution time: 73654.106 ms
(13 rows)

 avg_exe_time | std_dev_exe_time | min_exe_time | max_exe_time
--+--+--+--
  71881.15916 | 819.091605498189 |70197.312 |74653.314

It can be observed that the with the patch, merge join strategy is used
instead of hash join and the execution time reduces by approx 9%. A desired
effect is that the deviation in the execution time has reduced heavily
(almost by 75%).

2. Join between local and 

Re: [HACKERS] On-demand running query plans using auto_explain and signals

2015-12-17 Thread Shulgin, Oleksandr
On Wed, Dec 16, 2015 at 8:39 PM, Tomas Vondra 
wrote:

> Hi,
>
> On 12/01/2015 10:34 AM, Shulgin, Oleksandr wrote:
>
>>
>> I have the plans to make something from this on top of
>> pg_stat_statements and auto_explain, as I've mentioned last time.  The
>> next iteration will be based on the two latest patches above, so it
>> still makes sense to review them.
>>
>> As for EXPLAIN ANALYZE support, it will require changes to core, but
>> this can be done separately.
>>
>
> I'm re-reading the thread, and I have to admit I'm utterly confused what
> is the current plan, what features it's supposed to provide and whether it
> will solve the use case I'm most interested in. Oleksandr, could you please
> post a summary explaining that?
>
> My use case for this functionality is debugging of long-running queries,
> particularly getting EXPLAIN ANALYZE for them. In such cases I either can't
> run the EXPLAIN ANALYZE manually because it will either run forever (just
> like the query) and may not be the same (e.g. due to recent ANALYZE). So we
> need to extract the data from the process executing the query.
>
> I'm not essentially opposed to doing this in an extension (better an
> extension than nothing), but I don't quite see how you could to do that
> from pg_stat_statements or auto_explain. AFAIK both extensions merely use
> hooks before/after the executor, and therefore can't do anything in between
> (while the query is actually running).
>
> Perhaps you don't intend to solve this particular use case? Which use
> cases are you aiming to solve, then? Could you explain?
>

Hi Tomas.

Thanks for your interest in this patch!

My motivation is the same as your use case: having a long-running query, be
able to look inside this exact query run by this exact backend.

I admit the evolution of ideas in this patch can be very confusing: we were
trying a number of different approaches, w/o thinking deeply on the
implications, just to have a proof of concept.

Maybe all we need to do is add another hook somewhere in the executor, and
> push the explain analyze into pg_stat_statements once in a while, entirely
> eliminating the need for inter-process communication (signals, queues, ...).
>
> I'm pretty sure we don't need this for "short" queries, because in those
> cases we have other means to get the explain analyze (e.g. running the
> query again or auto_explain). So I can imagine having a rather high
> threshold (say, 60 seconds or even more), and we'd only push the explain
> analyze after crossing it. And then we'd only update it once in a while -
> say, again every 60 seconds.
>
> Of course, this might be configurable by two GUCs:
>
>pg_stat_statements.explain_analyze_threshold = 60  # -1 is "off"
>pg_stat_statements.explain_analyze_refresh = 60
>
> FWIW I'd still prefer having "EXPLAIN ANALYZE" in core, but better this
> than nothing.
>

Yes, this is how pg_stat_statements idea comes into play: even if we
implement support for online EXPLAIN ANALYZE, enabling it for every query
is an overkill, but if you don't enable it from the query start, there is
no way to enable it later on as the query has already progressed.  So in
order to know for which queries it makes sense to enable this feature, we
need to know what is the query's expected run time, thus pg_stat_statements
seems like a natural place to obtain this information and/or trigger the
behavior.

I'm also all for simplification of the underlying communication mechanism:
shared memory queues are neat, but not necessarily the best way to handle
it.  As for the use of signals: I believe this was a red herring, it will
make the code much less fragile if the progressing backend itself will
publish intermediary EXPLAIN ANALYZE reports for other backends to read.

The EXPLAIN (w/o ANALYZE) we can do completely as an extension: no core
support required.  To enable ANALYZE it will require a little hacking
around Instrumentation methods: otherwise the Explain functions just crash
when run in the middle of the query.

Hope that makes it clear.

--
Alex


Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2015-12-17 Thread Shulgin, Oleksandr
On Tue, Dec 15, 2015 at 11:30 PM, Tomas Vondra  wrote:

>
> Attached is a spreadsheet with results for various work_mem values, and
> also with a smaller data set (just 30M rows in the fact table), which
> easily fits into memory. Yet it shows similar gains, shaving off ~40% in
> the best case, suggesting that this is not just thanks to reduction of I/O
> when forcing the temp files to disk.


A neat idea!  Have you possibly tried to also collect statistics about
actual false-positive rates and filter allocation sizes in every of the
collected data points?

--
Alex


Re: [HACKERS] extend pgbench expressions with functions

2015-12-17 Thread Fabien COELHO



(2a) remove double stuff, just keep integer functions.
 I would rather keep min/max, though.


(2a) sounds like a fine plan to get something committable. We could keep 
min/max/abs, and remove sqrt/pi. What's actually the use case for debug? 
I cannot wrap my mind on one?


It was definitely useful to debug the double/int type stuff within 
expressions when writing a non trivial pgbench script. It is probably less 
interesting if there are only integers.



(2b) keep the patch with both int & double as is.


Functions returning double are not that useful... pgbench can live 
without them.


Sure.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel joins, and better parallel explain

2015-12-17 Thread Dilip Kumar
On Wed, Dec 17, 2015 at 11:03 AM Amit Kapila 
wrote:

> While looking at plans of Q5 and Q7, I have observed that Gather is
> pushed below another Gather node for which we don't have appropriate
> way of dealing.  I think that could be the reason why you are seeing
> the errors.

Ok

> Also, I think it would be good if you can once check the plan/execution
> time with max_parallel_degree=0 as that can give us base reference
> data without parallelism, also I am wondering if have you have changed
> any other parallel cost related parameter?

Oops, Earlier i had changed parallel_tuple_cost parameter to 0.01, now i
have changed it to default value 0.1 and taken performance again, with
 max_parallel_degree=0
and max_parallel_degree=4.

Note: Last time i used scale factor 1 for generating TPC-H data (./dbgen -v
-s 1), but after using default value of parallel_tuple_cost, it was not
selecting parallel join, so i have taken the results with scale factor 5
(./dbgen -v -s 5)

Below are the latest performance data.

1. TPC-H Q2:
max_parallel_degree=0
Planning time: 2.321 ms
Execution time: 829.817 ms

max_parallel_degree=4
Planning time: 2.530 ms
Execution time: 803.428 ms
2. TPC-H Q5:
max_parallel_degree=0
Planning time: 1.938 ms
Execution time: 1062.419 ms

max_parallel_degree=4
   Planning time: 2.950 ms
   Execution time: 487.461 ms

3. TPC-H Q7:
max_parallel_degree=0
   Planning time: 2.515 ms
   Execution time: 1651.763 ms

max_parallel_degree=4
Planning time: 2.379 ms
Execution time: 2107.863 ms

Plans for max_parallel_degree=0 and max_parallel_degree=4 are attached in
the mail with  file names are q*_base.out and q*_parallel.out respectively.

For Q3 its not selecting parallel plan.


Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Thu, Dec 17, 2015 at 11:03 AM, Amit Kapila 
wrote:

> On Wed, Dec 16, 2015 at 9:55 PM, Dilip Kumar 
> wrote:
>
>> On Wed, Dec 16, 2015 at 6:20 PM Amit Kapila 
>> wrote:
>>
>> >On Tue, Dec 15, 2015 at 7:31 PM, Robert Haas 
>> wrote:
>> >>
>> >> On Mon, Dec 14, 2015 at 8:38 AM, Amit Kapila 
>> wrote:
>>
>> > In any case,
>> >I have done some more investigation of the patch and found that even
>> >without changing query planner related parameters, it seems to give
>> >bad plans (as in example below [1]).  I think here the costing of rework
>> each
>>
>> I have done some more testing using TPC-H benchmark (For some of the
>> queries, specially for Parallel Hash Join), and Results summary is as below.
>>
>>
>> *Planning Time(ms)*
>> *Query* *Base* *Patch* TPC-H Q2 2.2 2.4 TPCH- Q3 0.67 0.71 TPCH- Q5 3.17
>> 2.3 TPCH- Q7 2.43 2.4
>>
>>
>>
>> *Execution Time(ms)*
>> *Query* *Base* *Patch* TPC-H Q2 2826 766 TPCH- Q3 23473 24271 TPCH- Q5
>> 21357 1432 TPCH- Q7 6779 1138
>> All Test files and Detail plan output is attached in mail
>> q2.sql, q3.sql, q.5.sql ans q7.sql are TPCH benchmark' 2nd, 3rd, 5th and
>> 7th query
>> and Results with base and Parallel join are attached in q*_base.out and
>> q*_parallel.out respectively.
>>
>> Summary: With TPC-H queries where ever Hash Join is pushed under gather
>> Node, significant improvement is visible,
>> with Q2, using 3 workers, time consumed is almost 1/3 of the base.
>>
>>
>> I Observed one problem, with Q5 and Q7, there some relation and snapshot
>> references are leaked and i am getting below warning, havn't yet looked
>> into the issue.
>>
>>
> While looking at plans of Q5 and Q7, I have observed that Gather is
> pushed below another Gather node for which we don't have appropriate
> way of dealing.  I think that could be the reason why you are seeing
> the errors.
>
> Also, I think it would be good if you can once check the plan/execution
> time with max_parallel_degree=0 as that can give us base reference
> data without parallelism, also I am wondering if have you have changed
> any other parallel cost related parameter?
>
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>


q2_base.out
Description: Binary data


q2_parallel.out
Description: Binary data


q5_base.out
Description: Binary data


q5_parallel.out
Description: Binary data


q7_base.out
Description: Binary data


q7_parallel.out
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_tables bug?

2015-12-17 Thread Tom Lane
Gaetano Mendola  writes:
> I'm playing around with tablespace (postgresq 9.4) and I found out what I
> believe is a bug in pg_tables.
> Basically if you create a database in a table space X and then you create a
> table on the database the table is created correctly on the tablespace X (
> I did a check on the filesystem) however if you do a select on pg_tables
> the column tablespace for that table is empty and even worst if you dump
> the DB there is no reporting about the the database or table being on that
> tablespace.
> Even \d doesn't report that the table is in the tablespace X.

An empty entry in that column means that the table is in the default
tablespace for the database.  Which it sounds like is what you have
here.  I think it's operating as designed, though you might quibble
with the decision that showing default tablespaces explicitly would
have been clutter.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance improvement for joins where outer side is unique

2015-12-17 Thread David Rowley
On 17 December 2015 at 19:11, Simon Riggs  wrote:

> On 17 December 2015 at 00:17, Tomas Vondra 
> wrote:
>
>> I'd go with match_first_tuple_only.
>
>
> +1
>
> unique_inner is a state that has been detected, match_first_tuple_only is
> the action we take as a result.
>
>
Ok great. I've made it so in the attached. This means the comment in the
join code where we perform the skip can be a bit less verbose and all the
details can go in where we're actually setting the match_first_tuple_only
to true.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_18-12-2015_843fb71.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Clarify vacuum verbose message

2015-12-17 Thread Robert Haas
On Tue, Dec 15, 2015 at 9:19 PM, Jim Nasby  wrote:
> VACUUM VERBOSE spits out two different messages for the heap, one of which
> is rather confusing:
>
> INFO:  "trades": removed 625664 row versions in 20967 pages
> ...
> INFO:  "trades": found 3282 removable, 56891627 nonremovable row versions in
> 1986034 out of 1986034 pages
>
> After discussion with RhodiumToad I think I now understand how this can
> happen:
>
> 20:00 < RhodiumToad> the LP_DEAD slot is where the index entries for the
> deleted row point to, so that has to stay
> 20:01 < RhodiumToad> so for example, if you delete a lot of rows, then try
> and do a lot of updates (which will hint the
>  pages as needing pruning),
> 20:01 < RhodiumToad> then do more updates or a seqscan (to let prune look at
> the pages),
> 20:02 < RhodiumToad> then do a vacuum, the vacuum will see a lot of LP_DEAD
> slots to remove index entries for, but not
>  actual tuples
>
> This example is from a table that was VACUUM FULL'd this weekend and had a
> nightly batch process run last night. That process INSERTs a bunch of rows
> and then does a bunch of UPDATEs on different subsets of those rows. I don't
> believe there would have been a large amount of deletes; I'll check with
> them tomorrow.
>
> IMHO we need to change the messages so they are explicit about line pointers
> vs actual tuples. Trying to obfuscate that just leads to confusion.
> heap_page_prune needs to report only non-rootlp tuples that were pruned.
> (None of the other callers care about the return value.)

Yeah, I've had the the thought before that this reporting could be
more clear.  I think it never really got revised when 8.3 invented
HOT.  It might be about time for that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Robert Haas
On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund  wrote:
> On 2015-12-11 17:00:01 +0300, Aleksander Alekseev wrote:
>> The problem is that code between LWLockAcquire (lock.c:881) and
>> LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using
>> old-good gettimeofday and logging method I managed to find a bottleneck:
>>
>> -- proclock = SetupLockInTable [lock.c:892]
>>  `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105]
>>`-- currBucket = get_hash_entry(hashp) [dynahash.c:985]
>>  `-- SpinLockAcquire(>mutex) [dynahash.c:1187]
>>
>> If my measurements are not wrong (I didn't place gettimeofday between
>> SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here
>> waiting for a spinlock, which doesn't seems right.
>
> Well, it's quite possible that your process was scheduled out, while
> waiting for that spinlock. Which'd make 3ms pretty normal.
>
> I'd consider using a LWLock instead of a spinlock here. I've seen this
> contended in a bunch of situations, and the queued behaviour, combined
> with directed wakeups on the OS level, ought to improve the worst case
> behaviour measurably.

Amit had the idea a while back of trying to replace the HASHHDR mutex
with something based on atomic ops.  It seems hard to avoid the
attendant A-B-A problems but maybe there's a way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2015-12-17 Thread Tomas Vondra

On 12/17/2015 11:44 AM, Simon Riggs wrote:


My understanding is that the bloom filter would be ineffective in any of
these cases
* Hash table is too small


Yes, although it depends what you mean by "too small".

Essentially if we can do with a single batch, then it's cheaper to do a 
single lookup in the hash table instead of multiple lookups in the bloom 
filter. The bloom filter might still win if it fits into L3 cache, but 
that seems rather unlikely.



* Bloom filter too large


Too large with respect to what?

One obvious problem is that the bloom filter is built for all batches at 
once, i.e. for all tuples, so it may be so big won't fit into work_mem 
(or takes a significant part of it). Currently it's not accounted for, 
but that'll need to change.



* Bloom selectivity > 50% - perhaps that can be applied dynamically,
so stop using it if it becomes ineffective


Yes. I think doing some preliminary selectivity estimation should not be 
difficult - that's pretty much what calc_joinrel_size_estimate() already 
does.


Doing that at dynamically is also possible, but quite tricky. Imagine 
for example the outer relation is sorted - in that case we may get long 
sequences of the same value (hash), and all of them will either have a 
match in the inner relation, or not have a match. That may easily skew 
the counters used for disabling the bloom filter dynamically.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Aleksander Alekseev
> It'd really like to see it being replaced by a queuing lock
> (i.e. lwlock) before we go there. And then maybe partition the
> freelist, and make nentries an atomic.

I believe I just implemented something like this (see attachment). The
idea is to partition PROCLOCK hash table manually into NUM_LOCK_
PARTITIONS smaller and non-partitioned hash tables. Since these tables
are non-partitioned spinlock is not used and there is no lock
contention.

On 60-core server we gain 3.5-4 more TPS according to benchmark
described above. As I understand there is no performance degradation in
other cases (different CPU, traditional pgbench, etc).

If this patch seems to be OK I believe we could consider applying the
same change not only to PROCLOCK hash table.diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 76fc615..1a86f86 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -249,15 +249,56 @@ static volatile FastPathStrongRelationLockData *FastPathStrongRelationLocks;
  * shared memory; LockMethodLocalHash is local to each backend.
  */
 static HTAB *LockMethodLockHash;
-static HTAB *LockMethodProcLockHash;
+static HTAB *LockMethodProcLockHashPartitions[NUM_LOCK_PARTITIONS];
 static HTAB *LockMethodLocalHash;
 
+/*
+ * LockMethodProcLockHash hash table is partitioned into NUM_LOCK_PARTITIONS
+ * usual (non-partitioned, i.e. without HASH_PARTITION flag) hash tables. The
+ * reason for partitioning LockMethodProcLockHash manually instead of just
+ * using single hash table with HASH_PARTITION flag is following. If hash
+ * table has HASH_PARTITION flag it uses a single spinlock to safely
+ * access some of its fields. Turned out in case of this particular hash
+ * table it causes a considerable performance degradation becouse of lock
+ * contention on servers with tens of cores.
+ */
+#define LockMethodProcLockHash(hashcode) \
+	(LockMethodProcLockHashPartitions[LockHashPartition(hashcode)])
+
+/*
+ * Each partition of LockMethodProcLockHash hash table has an unique name
+ * generated from this template using snprintf.
+ */
+static const char ProcLockHashNameTemplate[] = "PROCLOCK hash, partition %d";
+
+/*
+ * Size of buffer required to store LockMethodProcLockHash partition name.
+ *
+ * 10 is number of digits in 2^32. 2 is length of "%d" string.
+ */
+#define PROC_LOCK_HASH_NAME_BUFF_SIZE (sizeof(ProcLockHashNameTemplate)+10-2)
 
 /* private state for error cleanup */
 static LOCALLOCK *StrongLockInProgress;
 static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
+/*
+ * Calculate total number of entries in LockMethodProcLockHash hash table.
+ *
+ * Caller is responsible for having acquired appropriate LWLocks.
+ */
+static long
+GetProcLockHashNumEntries()
+{
+	int			i;
+	long		sum = 0;
+
+	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+		sum += hash_get_num_entries(LockMethodProcLockHashPartitions[i]);
+
+	return sum;
+}
 
 #ifdef LOCK_DEBUG
 
@@ -373,16 +414,16 @@ void
 InitLocks(void)
 {
 	HASHCTL		info;
-	long		init_table_size,
-max_table_size;
+	long		shmem_table_size;
 	bool		found;
+	int			i;
+	char		proclock_hash_name[PROC_LOCK_HASH_NAME_BUFF_SIZE];
 
 	/*
 	 * Compute init/max size to request for lock hashtables.  Note these
 	 * calculations must agree with LockShmemSize!
 	 */
-	max_table_size = NLOCKENTS();
-	init_table_size = max_table_size / 2;
+	shmem_table_size = NLOCKENTS();
 
 	/*
 	 * Allocate hash table for LOCK structs.  This stores per-locked-object
@@ -394,14 +435,13 @@ InitLocks(void)
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
 	LockMethodLockHash = ShmemInitHash("LOCK hash",
-	   init_table_size,
-	   max_table_size,
+	   shmem_table_size,
+	   shmem_table_size,
 	   ,
 	HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
 
 	/* Assume an average of 2 holders per lock */
-	max_table_size *= 2;
-	init_table_size *= 2;
+	shmem_table_size = (shmem_table_size * 2) / NUM_LOCK_PARTITIONS;
 
 	/*
 	 * Allocate hash table for PROCLOCK structs.  This stores
@@ -412,11 +452,17 @@ InitLocks(void)
 	info.hash = proclock_hash;
 	info.num_partitions = NUM_LOCK_PARTITIONS;
 
-	LockMethodProcLockHash = ShmemInitHash("PROCLOCK hash",
-		   init_table_size,
-		   max_table_size,
-		   ,
- HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
+	{
+		snprintf(proclock_hash_name, sizeof(proclock_hash_name),
+ ProcLockHashNameTemplate, i + 1);
+
+		LockMethodProcLockHashPartitions[i] = ShmemInitHash(proclock_hash_name,
+			shmem_table_size,
+			shmem_table_size,
+			,
+  HASH_ELEM | HASH_FUNCTION);
+	}
 
 	/*
 	 * Allocate fast-path structures.
@@ -943,7 +989,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 proclock_hashcode = ProcLockHashCode(>tag, hashcode);
 SHMQueueDelete(>lockLink);
 SHMQueueDelete(>procLink);
-if 

Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2015-12-17 Thread Tomas Vondra

Hi,

On 12/17/2015 10:50 AM, Shulgin, Oleksandr wrote:

On Tue, Dec 15, 2015 at 11:30 PM, Tomas Vondra
> wrote:


Attached is a spreadsheet with results for various work_mem
values, and also with a smaller data set (just 30M rows in the fact
table), which easily fits into memory. Yet it shows similar gains,
shaving off ~40% in the best case, suggesting that this is not just
thanks to reduction of I/O when forcing the temp files to disk.


A neat idea! Have you possibly tried to also collect statistics
about actual false-positive rates and filter allocation sizes in
every of the collected data points?


The patch counts and prints the total number of lookups, and the number 
of eliminated rows. The bloom filter is currently sized for 5% false 
positives rate, and the numbers I've seen match that.


I think ultimately we'll need to measure the false positive rate, so 
that we can use it to dynamically disable the bloom filter if it gets 
inefficient. Also maybe put some of that into EXPLAIN ANALYZE.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Using a single standalone-backend run in initdb (was Re: [HACKERS] Bootstrap DATA is a pita)

2015-12-17 Thread Tom Lane
Mark Dilger  writes:
> I use CREATE RULE within startup files in the fork that I maintain.  I have
> lots of them, totaling perhaps 50k lines of rule code.  I don't think any of 
> that
> code would have a problem with the double-newline separation you propose,
> which seems a more elegant solution to me.  Admittedly, the double-newline
> separation would need to be documented at the top of each sql file, otherwise
> it would be quite surprising to those unfamiliar with it.

> You mentioned upthread that introducing a syntax error in one of these files
> results in a not-so-helpful error message that dumps the contents of the
> entire file.  I can confirm that happens, and is hardly useful.

Not having heard any ideas that sounded better than semi-newline-newline,
I went ahead and finished up this patch on that basis.  Attached are two
patch files; the first one redefines the behavior of -j, and the second
one modifies initdb to use only one standalone-backend run.  I present
them this way to emphasize that the -j change doesn't break much of
anything: initdb still works if you apply only the first patch.  And
I didn't change anything in the initdb input files, except for adding
the comment documentation Mark suggests above.

In passing in the first patch, I got rid of the TCOP_DONTUSENEWLINE
#define, which could not have been used by anyone in a very long time
because it would break initdb.  I then realized that
src/include/tcop/tcopdebug.h is completely dead code, because the
other debugging symbol it claims to specify got ripped out long ago.
And to add insult to injury, that file is included noplace.  I imagine
Bruce's pgrminclude tool got rid of the inclusion that must once have
existed in postgres.c, after observing that postgres.c still compiled
without it :-(.  (That tool really requires more adult supervision
than it has gotten.) So anyway, this patch removes tcopdebug.h entirely.

The second patch consists of removing extra backend starts/stops
and converting all of initdb's code to run in -j mode, rather than the
mishmash of -j and not-j behavior that was there before.  I changed
all the semicolon-newlines in initdb's command strings to
semicolon-newline-newlines.  As mentioned before, only a small number of
those changes *had* to be made to get it to work, namely the ones around
VACUUM and CREATE DATABASE statements, but I felt that for consistency
and error localization all of them should be changed.  I also failed to
resist the temptation to const-ify some of the arrays more thoroughly.

Barring objections I'll push this into HEAD soon.

regards, tom lane

diff --git a/doc/src/sgml/ref/postgres-ref.sgml b/doc/src/sgml/ref/postgres-ref.sgml
index e2e9909..d60d4ff 100644
*** a/doc/src/sgml/ref/postgres-ref.sgml
--- b/doc/src/sgml/ref/postgres-ref.sgml
*** PostgreSQL documentation
*** 529,535 
  
  
  
!  The following options only apply to the single-user mode.
  
  
  
--- 529,537 
  
  
  
!  The following options only apply to the single-user mode
!  (see ).
  
  
  
*** PostgreSQL documentation
*** 558,564 
-E

 
! Echo all commands.
 

   
--- 560,566 
-E

 
! Echo all commands to standard output before executing them.
 

   
*** PostgreSQL documentation
*** 567,573 
-j

 
! Disables use of newline as a statement delimiter.
 

   
--- 569,576 
-j

 
! Use semicolon followed by two newlines, rather than just newline,
! as the command entry terminator.
 

   
*** PostgreSQL documentation
*** 760,767 

   
  
!  
!   Usage
  
 
  To start a single-user mode server, use a command like
--- 763,770 

   
  
!  
!   Single-User Mode
  
 
  To start a single-user mode server, use a command like
*** PostgreSQL documentation
*** 778,807 
  entry terminator; there is no intelligence about semicolons,
  as there is in psql.  To continue a command
  across multiple lines, you must type backslash just before each
! newline except the last one.
 
  
 
! But if you use the -j command line switch, then newline does
! not terminate command entry.  In this case, the server will read the standard input
! until the end-of-file (EOF) marker, then
! process the input as a single command string.  Backslash-newline is not
! treated specially in this case.
 
  
 
  To quit the session, type EOF
  (ControlD, usually).
! If you've
! used -j, two consecutive EOFs are needed to exit.
 
  
 
  Note that the single-user mode server does not provide sophisticated
  line-editing features (no command history, for 

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Andres Freund
On 2015-12-17 09:47:57 -0500, Robert Haas wrote:
> On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund  wrote:
> > I'd consider using a LWLock instead of a spinlock here. I've seen this
> > contended in a bunch of situations, and the queued behaviour, combined
> > with directed wakeups on the OS level, ought to improve the worst case
> > behaviour measurably.
> 
> Amit had the idea a while back of trying to replace the HASHHDR mutex
> with something based on atomic ops.  It seems hard to avoid the
> attendant A-B-A problems but maybe there's a way.

It'd really like to see it being replaced by a queuing lock
(i.e. lwlock) before we go there. And then maybe partition the freelist,
and make nentries an atomic.  Just doing those might already be good
enough and should be a lot easier.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-12-17 Thread Dean Rasheed
On 10 December 2015 at 20:02, Tom Lane  wrote:
> Robert Haas  writes:
>> It seems to be a loss of 4 digits in every case I've seen.
>
> I wouldn't have a problem with, say, throwing in an extra DEC_DIGITS worth
> of rscale in each of these functions so that the discrepancies tend to
> favor more significant digits out, rather than fewer.  I don't know that
> it's worth trying to guarantee that the result is never fewer digits than
> before, and I certainly wouldn't want to make the rules a lot more complex
> than what's there now.  But perhaps we could cover most cases easily.
>

Looking at this, it appears that those extra digits of precision for
log(0.5) in the old code are an anomaly that only occurs for a certain
range of inputs. According to the code comments these functions
intentionally output at least around 16 significant digits (or more if
the input has greater precision), so that they output at least the
precision of floating point. For example, in both 9.5 and HEAD:

select exp(5::numeric);
exp

 148.41315910257660

select exp(0.5::numeric);
exp

 1.6487212707001281

select ln(5::numeric);
 ln

 1.6094379124341004

select ln(0.5::numeric);
 ln
-
 -0.6931471805599453

select power(0.5::numeric, 0.4::numeric);
   power

 0.7578582832551990


However, the old log() code would occasionally output 4 more digits
than that, due to it's mis-estimation of the result weight, which was
used to determine the output scale. So, for example, in 9.5:

select log(0.0005::numeric);
 log
-
 -3.3010299956639812

select log(0.005::numeric);
 log
-
 -2.3010299956639812

select log(0.05::numeric);
   log
-
 -1.30102999566398119521

select log(0.5::numeric);
   log
-
 -0.30102999566398119521

select log(5::numeric);
  log

 0.69897000433601880479

select log(50::numeric);
log

 1.6989700043360188

select log(500::numeric);
log

 2.6989700043360188

i.e., for a certain range of inputs the result precision jumps from 16
to 20 digits after the decimal point, whereas in HEAD the precision of
the results is more consistent across the range:

select log(0.0005::numeric);
 log
-
 -3.3010299956639812

select log(0.005::numeric);
 log
-
 -2.3010299956639812

select log(0.05::numeric);
 log
-
 -1.3010299956639812

select log(0.5::numeric);
 log
-
 -0.3010299956639812

select log(5::numeric);
log

 0.6989700043360188

select log(50::numeric);
log

 1.6989700043360188

select log(500::numeric);
log

 2.6989700043360188


With other inputs, the actual number of significant digits can vary
between 16 and 17, but it's generally better behaved than the old
code, even though it sometimes produces fewer digits. I think it ought
to be sufficient to release note that the number of digits returned by
these functions may have changed, in addition to the results being
more accurate.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Robert Haas
On Thu, Dec 17, 2015 at 12:14 PM, Andres Freund  wrote:
> On 2015-12-17 09:04:25 -0800, Jeff Janes wrote:
>> > But I'm somewhat confused what this has to do with Andres's report.
>>
>> Doesn't it explain the exact situation he is in, where the oldest
>> database is 200 million, but the cluster as a whole is 2 billion?
>
> There were no crashes, so no, I don't think so.

Backing up a step, do we think that the fact that this was running in
a shell rather than a screen is relevant somehow?  Or did something
happen to this particular cluster totally unrelated to that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Andres Freund
On 2015-12-17 13:08:15 -0500, Robert Haas wrote:
> On Thu, Dec 17, 2015 at 12:14 PM, Andres Freund  wrote:
> > On 2015-12-17 09:04:25 -0800, Jeff Janes wrote:
> >> > But I'm somewhat confused what this has to do with Andres's report.
> >>
> >> Doesn't it explain the exact situation he is in, where the oldest
> >> database is 200 million, but the cluster as a whole is 2 billion?
> >
> > There were no crashes, so no, I don't think so.
> 
> Backing up a step, do we think that the fact that this was running in
> a shell rather than a screen is relevant somehow?  Or did something
> happen to this particular cluster totally unrelated to that?

I reran the whole thing on a separate, but very similar, VM. Just
checked. Same thing happened. This time I have log files and
everything. No time to investigate right now, but it's reproducible if
you accept running tests for a week or so.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2015-12-17 Thread Robert Haas
On Mon, Dec 14, 2015 at 3:04 PM, Tomas Vondra
 wrote:
> attached is v1 of one of the hashjoin improvements mentioned in September in
> the lengthy thread [1].
>
> The main objection against simply removing the MaxAllocSize check (and
> switching to MemoryContextAllocHuge) is that if the number of rows is
> overestimated, we may consume significantly more memory than necessary.
>
> We run into this issue because we allocate the buckets at the very
> beginning, based on the estimate. I've noticed we don't really need to do
> that - we don't really use the buckets until after the Hash node completes,
> and we don't even use it when incrementing the number of batches (we use the
> dense allocation for that).
>
> So this patch removes this - it postpones allocating the buckets to the end
> of MultiExecHash(), and at that point we know pretty well what is the
> optimal number of buckets.
>
> This makes tracking nbuckets_optimal/log2_nbuckets_optimal unnecessary, as
> we can simply use nbuckets/log2_nbuckets for that purpose. I've also removed
> nbuckets_original, but maybe that's not a good idea and we want to keep that
> information (OTOH we never use that number of buckets).
>
> This patch does not change the estimation in ExecChooseHashTableSize() at
> all, because we still need to do that to get nbucket/nbatch. Maybe this is
> another opportunity for improvement in case of overestimates, because in
> that case it may happen that we do batching even when we could do without
> it. So we might start with nbuckets=1024 and nbatches=1, and only switch to
> the estimated number of batches if really needed.
>
> This patch also does not mess with the allocation, i.e. it still uses the
> MaxAllocSize limit (which amounts to ~256MB due to the doubling, IIRC), but
> it should make it easier to do that change.

If this doesn't regress performance in the case where the number of
buckets is estimated accurately to begin with, then I think this is a
great idea.  Can you supply some performance tests results for that
case, and maybe some of the other cases also?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] psql - -dry-run option

2015-12-17 Thread Pavel Stehule
Hi

when I read a blog
http://www.depesz.com/2015/12/14/waiting-for-9-6-psql-support-multiple-c-and-f-options-and-allow-mixing-them/
where is emulated dry-run mode, I though so we can implement it very
simply.

Notices, comments?

Regards

Pavel


Re: Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Andres Freund
On 2015-12-17 09:04:25 -0800, Jeff Janes wrote:
> > But I'm somewhat confused what this has to do with Andres's report.
> 
> Doesn't it explain the exact situation he is in, where the oldest
> database is 200 million, but the cluster as a whole is 2 billion?

There were no crashes, so no, I don't think so.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Jeff Janes
On Thu, Dec 10, 2015 at 1:55 PM, Andres Freund  wrote:

>
> Looking at datfrozenxid:
> postgres=# select datname, datfrozenxid, age(datfrozenxid) FROM pg_database ;
>   datname  | datfrozenxid |age
> ---+--+---
>  template1 |   3357685367 | 0
>  template0 |   3357685367 | 0
>  postgres  |   3159867733 | 197817634
> (3 rows)
> reveals that the launcher doesn't do squat because it doesn't think it
> needs to do anything.
>
> (gdb) p *ShmemVariableCache
> $3 = {nextOid = 24576, oidCount = 0, nextXid = 3357685367, oldestXid = 
> 1211201715, xidVacLimit = 1411201715, xidWarnLimit = 3347685362,
>   xidStopLimit = 3357685362, xidWrapLimit = 3358685362, oldestXidDB = 12380, 
> oldestCommitTs = 0, newestCommitTs = 0,
>   latestCompletedXid = 3357685366}

Do we know how template0 and template1 get frozen with xid which were
5 past the xidStopLimit?  Is that part of the mystery here, or is that
normal?

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

2015-12-17 Thread Robert Haas
On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra
 wrote:
> So we know that we should expect about
>
>   (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)
>
>   (   regular WAL) + (  FPW WAL )
>
> to be produced until the end of the current checkpoint. I don't have a clear
> idea how to transform this into the 'progress' yet, but I'm pretty sure
> tracking the two types of WAL is a key to a better solution. The x^1.5 is
> probably a step in the right direction, but I don't feel particularly
> confident about the 1.5 (which is rather arbitrary).

If it works well empirically, does it really matter that it's
arbitrary?  I mean, the entire planner is full of fairly arbitrary
assumptions about which things to consider in the cost model and which
to ignore.  The proof that we have made good decisions there is in the
query plans it generates.  (The proof that we have made bad decisions
in some cases in the query plans, too.)

I think a bigger problem for this patch is that Heikki seems to have
almost completely disappeared.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?

2015-12-17 Thread Corey Huinker
On Wed, Dec 16, 2015 at 4:24 PM, Peter Geoghegan  wrote:

> On Wed, Dec 16, 2015 at 12:28 PM, Robert Haas 
> wrote:
> >> I seem to be able to produce these sorting patches at a much greater
> >> rate than they can be committed, in part because Robert is the only
> >> one that ever reviews them, and he is only one person.
> >
> > I object to that vicious slander.  I am at least three people, if not
> more!
>
> I was referring only to the Robert that reviews my sorting patches.  :-)
>
> > Meanwhile, I did some simple benchmarking on your latest patch on my
> > MacBook Pro.  I did pgbench -i -s 100 and then tried:
> >
> > create index x on pgbench_accounts (aid);
> > create index concurrently x on pgbench_accounts (aid);
> >
> > The first took about 6.9 seconds.  The second took about 11.3 seconds
> > patched versus 14.6 seconds unpatched.  That's certainly a healthy
> > improvement.
>
> That seems pretty good. It probably doesn't matter, but FWIW it's
> likely that your numbers are not as good as mine because this ends up
> with a perfect logical/physical correlation, which the quicksort
> precheck [1] does very well on when sorting the TIDs (since input is
> *perfectly* correlated, as opposed to 99.99% correlated, a case that
> does poorly [2]).
>
> > I have also reviewed the code, and it looks OK to me, so committed.
>
> Thanks!
>
> [1] Commit a3f0b3d68f9a5357a3f72b40a45bcc714a9e0649
> [2] http://www.postgresql.org/message-id/54eb580c.2000...@2ndquadrant.com
> --
> Peter Geoghegan
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

My apologies to Peter and all the Roberts, I wasn't able to set up a test
fast enough. Glad it got committed.


Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-12-17 Thread Robert Haas
On Thu, Dec 17, 2015 at 10:38 AM, Dean Rasheed  wrote:
> On 10 December 2015 at 20:02, Tom Lane  wrote:
>> Robert Haas  writes:
>>> It seems to be a loss of 4 digits in every case I've seen.
>>
>> I wouldn't have a problem with, say, throwing in an extra DEC_DIGITS worth
>> of rscale in each of these functions so that the discrepancies tend to
>> favor more significant digits out, rather than fewer.  I don't know that
>> it's worth trying to guarantee that the result is never fewer digits than
>> before, and I certainly wouldn't want to make the rules a lot more complex
>> than what's there now.  But perhaps we could cover most cases easily.
>>
>
> Looking at this, it appears that those extra digits of precision for
> log(0.5) in the old code are an anomaly that only occurs for a certain
> range of inputs. According to the code comments these functions
> intentionally output at least around 16 significant digits (or more if
> the input has greater precision), so that they output at least the
> precision of floating point. For example, in both 9.5 and HEAD:
>
> select exp(5::numeric);
> exp
> 
>  148.41315910257660
>
> select exp(0.5::numeric);
> exp
> 
>  1.6487212707001281
>
> select ln(5::numeric);
>  ln
> 
>  1.6094379124341004
>
> select ln(0.5::numeric);
>  ln
> -
>  -0.6931471805599453
>
> select power(0.5::numeric, 0.4::numeric);
>power
> 
>  0.7578582832551990
>
>
> However, the old log() code would occasionally output 4 more digits
> than that, due to it's mis-estimation of the result weight, which was
> used to determine the output scale. So, for example, in 9.5:
>
> select log(0.0005::numeric);
>  log
> -
>  -3.3010299956639812
>
> select log(0.005::numeric);
>  log
> -
>  -2.3010299956639812
>
> select log(0.05::numeric);
>log
> -
>  -1.30102999566398119521
>
> select log(0.5::numeric);
>log
> -
>  -0.30102999566398119521
>
> select log(5::numeric);
>   log
> 
>  0.69897000433601880479
>
> select log(50::numeric);
> log
> 
>  1.6989700043360188
>
> select log(500::numeric);
> log
> 
>  2.6989700043360188
>
> i.e., for a certain range of inputs the result precision jumps from 16
> to 20 digits after the decimal point, whereas in HEAD the precision of
> the results is more consistent across the range:
>
> select log(0.0005::numeric);
>  log
> -
>  -3.3010299956639812
>
> select log(0.005::numeric);
>  log
> -
>  -2.3010299956639812
>
> select log(0.05::numeric);
>  log
> -
>  -1.3010299956639812
>
> select log(0.5::numeric);
>  log
> -
>  -0.3010299956639812
>
> select log(5::numeric);
> log
> 
>  0.6989700043360188
>
> select log(50::numeric);
> log
> 
>  1.6989700043360188
>
> select log(500::numeric);
> log
> 
>  2.6989700043360188
>
>
> With other inputs, the actual number of significant digits can vary
> between 16 and 17, but it's generally better behaved than the old
> code, even though it sometimes produces fewer digits. I think it ought
> to be sufficient to release note that the number of digits returned by
> these functions may have changed, in addition to the results being
> more accurate.

Thanks for the additional explanation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Jeff Janes
Sorry, accidentally failed to include the list originally, here it is
for the list:

On Dec 16, 2015 9:52 AM, "Robert Haas"  wrote:
>
> On Fri, Dec 11, 2015 at 1:08 PM, Jeff Janes  wrote:
> > Since changes to datfrozenxid are WAL logged at the time they occur,
> > but the supposedly-synchronous change to ShmemVariableCache is not WAL
> > logged until the next checkpoint, a well timed crash can leave you in
> > the state where the system is in a tizzy about wraparound but each
> > database says "Nope, not me".
>
> ShmemVariableCache is an in-memory data structure, so it's going to
> get blown away and rebuilt on a crash.  But I guess it gets rebuild
> from the contents of the most recent checkpoint record, so that
> doesn't actually help.  However, I wonder if it would be safe to for
> the autovacuum launcher to calculate an updated value and call
> SetTransactionIdLimit() to update ShmemVariableCache.

I was wondering if that should happen either at the end of crash
recovery (but I suppose you can't poll pg_database yet at that
point?), or immediately before throwing the "database is not accepting
commands to avoid wraparound data loss" error.

At which point would it make sense for the launcher do it?  I guess
just after it was started up under PMSIGNAL_START_AUTOVAC_LAUNCHER
conditions?

> But I'm somewhat confused what this has to do with Andres's report.

Doesn't it explain the exact situation he is in, where the oldest
database is 200 million, but the cluster as a whole is 2 billion?

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Robert Haas
On Thu, Dec 17, 2015 at 11:03 AM, Aleksander Alekseev
 wrote:
>> It'd really like to see it being replaced by a queuing lock
>> (i.e. lwlock) before we go there. And then maybe partition the
>> freelist, and make nentries an atomic.
>
> I believe I just implemented something like this (see attachment). The
> idea is to partition PROCLOCK hash table manually into NUM_LOCK_
> PARTITIONS smaller and non-partitioned hash tables. Since these tables
> are non-partitioned spinlock is not used and there is no lock
> contention.

Oh, that's an interesting idea.  I guess the problem is that if the
freelist is unshared, then users might get an error that the lock
table is full when some other partition still has elements remaining.

> On 60-core server we gain 3.5-4 more TPS according to benchmark
> described above. As I understand there is no performance degradation in
> other cases (different CPU, traditional pgbench, etc).

3.5-4 more TPS, or 3.5 times more TPS?  Can you share the actual numbers?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-17 Thread Masahiko Sawada
On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier
 wrote:
> On Thu, Dec 10, 2015 at 3:31 AM, Robert Haas  wrote:
>> On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian  wrote:
>>> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
 On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
 > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
 > wrote:
 >>
 >> Yeah, we need to consider to compute checksum if enabled.
 >> I've changed the patch, and attached.
 >> Please review it.
 >
 > Thanks for the update.  This now conflicts with the updates doesn to
 > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
 > conflict in order to do some testing, but I'd like to get an updated
 > patch from the author in case I did it wrong.  I don't want to find
 > bugs that I just introduced myself.
 >

 Thank you for having a look.
>>>
>>> I would not bother mentioning this detail in the pg_upgrade manual page:
>>>
>>> +   Since the format of visibility map has been changed in version 9.6,
>>> +   pg_upgrade creates and rewrite new 
>>> '_vm'
>>> +   file even if upgrading from 9.5 or before to 9.6 or later with link 
>>> mode (-k).
>>
>> Really?  I know we don't always document things like this, but it
>> seems like a good idea to me that we do so.
>
> Just going though v30...
>
> +frozen. The whole-table freezing is occuerred only when all pages happen 
> to
> +require freezing to freeze rows. In other cases such as where
>
> I am not really getting the meaning of this sentence. Shouldn't this
> be reworded something like:
> "Freezing occurs on the whole table once all pages of this relation require 
> it."
>
> +relfrozenxid is more than
> vacuum_freeze_table_age
> +transcations old, where VACUUM's FREEZE
> option is used,
> +VACUUM can skip the pages that all tuples on the page
> itself are
> +marked as frozen.
> +When all pages of table are eventually marked as frozen by
> VACUUM,
> +after it's finished age(relfrozenxid) should be a little more
> +than the vacuum_freeze_min_age setting that was used (more by
> +the number of transcations started since the VACUUM started).
> +If the advancing of relfrozenxid is not happend until
> +autovacuum_freeze_max_age is reached, an autovacuum will soon
> +be forced for the table.
>
> s/transcations/transactions.
>
> + n_frozen_page
> + integer
> + Number of frozen pages
> n_frozen_pages?
>
> make check with pg_upgrade is failing for me:
> Visibility map rewriting test failed
> make: *** [check] Error 1

make check with pg_upgrade is done successfully on my environment.
Could you give me more information about this?

> -GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
> +GetVisibilitymapPins(Relation relation, Buffer buffer1, Buffer buffer2,
> This looks like an unrelated change.
>
>   * Clearing a visibility map bit is not separately WAL-logged.  The callers
>   * must make sure that whenever a bit is cleared, the bit is cleared on WAL
> - * replay of the updating operation as well.
> + * replay of the updating operation as well.  And all-frozen bit must be
> + * cleared with all-visible at the same time.
> This could be reformulated. This is just an addition on top of the
> existing paragraph.
>
> + * The visibility map has the all-frozen bit which indicates all tuples on
> + * corresponding page has been completely frozen, so the visibility map is 
> also
> "have been completely frozen".
>
> -/* Mapping from heap block number to the right bit in the visibility map */
> -#define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE)
> -#define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) /
> HEAPBLOCKS_PER_BYTE)
> Those two declarations are just noise in the patch: those definitions
> are unchanged.
>
> -   elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk);
> +   elog(DEBUG1, "vm_clear %s block %d",
> RelationGetRelationName(rel), heapBlk);
> This may be better as a separate patch.

I've attached 001 patch for this separately.

>
> -visibilitymap_count(Relation rel)
> +visibilitymap_count(Relation rel, BlockNumber *all_frozen)
> I think that this routine would gain in clarity if reworked as follows:
> visibilitymap_count(Relation rel, BlockNumber *all_visible,
> BlockNumber *all_frozen)
>
> +   /*
> +* Report ANALYZE to the stats collector, too.
> However, if doing
> +* inherited stats we shouldn't report, because the
> stats collector only
> +* tracks per-table stats.
> +*/
> +   if (!inh)
> +   pgstat_report_analyze(onerel, totalrows,
> totaldeadrows, relallfrozen);
> Here we already know that this is working in the non-inherited code
> path. As a whole, 

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-12-17 Thread Masahiko Sawada
On Mon, Dec 14, 2015 at 2:57 PM, Kyotaro HORIGUCHI
 wrote:
> Thank you for the new patch.
>
> At Wed, 9 Dec 2015 20:59:20 +0530, Masahiko Sawada  
> wrote in 
>> On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada  
>> wrote:
>> > I agree with #3 way and the s_s_name format you suggested.
>> > I think that It's extensible and is tolerable for future changes.
>> > I'm going to implement the patch based on this idea if other hackers
>> > agree with this design.
>>
>> Please find the attached draft patch which supports multi sync replication.
>> This patch adds a GUC parameter synchronous_replication_method, which
>> represent the method of synchronous replication.
>>
>> [Design of replication method]
>> synchronous_replication_method has two values; 'priority' and
>> '1-priority' for now.
>> We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the 
>> future.
>>
>> * s_r_method = '1-priority'
>> This method is for backward compatibility, so the syntax of s_s_names
>> is same as today.
>> The behavior is same as well.
>>
>> * s_r_method = 'priority'
>> This method is for multiple synchronous replication using priority method.
>> The syntax of s_s_names is,
>>,  [, ...]
>
> Is there anyone opposed to this?
>
>> For example, s_r_method = 'priority' and s_s_names = '2, node1, node2,
>> node3' means that the master waits for  acknowledge from at least 2
>> lowest priority servers.
>> If 4 standbys(node1 - node4) are available, the master server waits
>> acknowledge from 'node1' and 'node2.
>> The each status of wal senders are;
>>
>> =# select application_name, sync_state from pg_stat_replication order
>> by application_name;
>> application_name | sync_state
>> --+
>> node1| sync
>> node2| sync
>> node3| potential
>> node4| async
>> (4 rows)
>>
>> After 'node2' crashed, the master will wait for acknowledge from
>> 'node1' and 'node3'.
>> The each status of wal senders are;
>>
>> =# select application_name, sync_state from pg_stat_replication order
>> by application_name;
>> application_name | sync_state
>> --+
>> node1| sync
>> node3| sync
>> node4| async
>> (3 rows)
>>
>> [Changing replication method]
>> When we want to change the replication method, we have to change the
>> s_r_method  at first, and then do pg_reload_conf().
>> After changing replication method, we can change the s_s_names.

Thank you for reviewing the patch!
Please find attached latest patch.

> Mmm. I should be able to be changed at once, because s_r_method
> and s_s_names contradict each other during the intermediate
> state.

Sorry to confuse you. I meant the case where we want to change the
replication method using ALTER SYSTEM.


>> [Expanding replication method]
>> If we want to expand new replication method additionally, we need to
>> implement two functions for each replication method:
>> * int SyncRepGetSynchronousStandbysXXX(int *sync_standbys)
>>   This function obtains the list of standbys considered as synchronous
>> at that time, and return its length.
>> * bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
>>   This function obtains LSNs(write, flush) considered as synced.
>>
>> Also, this patch debug code is remain yet, you can debug this behavior
>> using by enable DEBUG_REPLICATION macro.
>>
>> Please give me feedbacks.
>
> I haven't looked into this fully (sorry) but I'm concerned about
> several points.
>
>
> - I feel that some function names looks too long. For example
>   SyncRepGetSynchronousStandbysOnePriority occupies more than the
>   half of a line. (However, the replication code alrady has many
>   long function names..)

Yeah, it would be better to change 'Synchronous' to 'Sync' at least.

> - The comment below of SyncRepGetSynchronousStandbyOnePriority,
>   >   /* Find lowest priority standby */
>
>   The code where the comment is for is doing the correct
>   thing. Howerver, the comment is confusing. A lower priority
>   *value* means a higher priority.

Fixed.

> - SyncRepGetSynchronousStandbys checks all if()s even when the
>   first one matches. Use switch or "else if" there if you they
>   are exclusive each other.

Fixed.

> - Do you intende the DEBUG_REPLICATION code in
>   SyncRepGetSynchronousStandbys*() to be the final shape?  The
>   same code blocks which can work for both method should be in
>   their common caller but SyncRepGetSyncLsns*() are
>   headache. Although it might need more refactoring, I'm sorry
>   but I don't see a desirable shape for now.

I'm not going to DEBUG_REPLICAION code to be  the final shape.
These codes are removed from this version patch.

>   By the way, palloc(20)/free() in such short term looks
>   ineffective.
>
> - SyncRepGetSyncLsnsPriority
>
>   

Re: [HACKERS] psql - -dry-run option

2015-12-17 Thread Pavel Stehule
2015-12-17 20:14 GMT+01:00 Pavel Stehule :

>
>
> 2015-12-17 20:03 GMT+01:00 Tom Lane :
>
>> Pavel Stehule  writes:
>> > when I read a blog
>> >
>> http://www.depesz.com/2015/12/14/waiting-for-9-6-psql-support-multiple-c-and-f-options-and-allow-mixing-them/
>> > where is emulated dry-run mode, I though so we can implement it very
>> > simply.
>>
>> Not one that is actually reliable.  All a script would have to do is
>> include its own begin/commit commands, and it would override what you
>> are talking about.  It's okay, in my opinion, if the -1 switch is just a
>> half-baked "best effort" solution.  It's not okay to provide a --dry-run
>> switch that is equally full of holes, because if someone were to actually
>> rely on it to not execute the script, the possibility of an override would
>> amount to a security bug.
>>
>
> My idea was enforce global transaction (-1) option and ensure
> STOP_ON_ERROR mode (cannot be changed later). Any inner COMMIT or ROLLBACK
> have to be disallowed (or ignored) - what can be problem :(
>
> and if all statements from input stream was processed, then ROLLBACK is
> emitted, but result is success.
>
> Pavel
>

or different idea - just enforce syntax check without execution.


>
>
>>
>> regards, tom lane
>>
>
>


Re: [HACKERS] Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?

2015-12-17 Thread Peter Geoghegan
On Thu, Dec 17, 2015 at 9:29 AM, Corey Huinker  wrote:
> My apologies to Peter and all the Roberts, I wasn't able to set up a test
> fast enough. Glad it got committed.

I don't use the term "slam-dunk" casually. :-)

This was the first time I ever referred to a patch of mine that way.
It doesn't happen very often, but it does happen.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql - -dry-run option

2015-12-17 Thread Alvaro Herrera
Joe Conway wrote:
> On 12/17/2015 11:58 AM, Christopher Browne wrote:
> > On 17 December 2015 at 14:16, Pavel Stehule  > > wrote:
> >> or different idea - just enforce syntax check without execution.
> > 
> > That seems pretty cool...  I'd find "syntax check without execution" to
> > be pretty useful to test SQL (and especially DDL).
> 
> Like this?
> https://github.com/jconway/pgsynck

I thought the idea was to test the execution of the commands themselves,
not just the syntax.  Something like add a column here and see whether
this other complex UPDATE populates it correctly.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-12-17 Thread Robert Haas
On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote
 wrote:
> Syntax to create a partitioned table (up to 2 levels of partitioning):
>
> CREATE TABLE foo (
> ...
> )
> PARTITION BY R/L ON (key0)
> SUBPARTITION BY R/L ON (key1)
> [(PARTITION foo_1 FOR VALUES  [] []
> [(SUBPARTITION foo_1_1 FOR VALUES  [] [],
> ...)], ...)];
>
> The above creates two pg_partitioned_rel entries for foo with partlevel 0
> and 1, for key0 and key1, respectively. For foo_1 and foo_1_1, this
> creates pg_partition entries, with foo and foo_1 as partparent,
> respectively.
>
> Why just 2 levels? - it seems commonplace and makes the syntax more
> intuitive? I guess it might be possible to generalize the syntax for
> multi-level partitioning. Ideas? If we want to support the notion of
> sub-partition template in future, that would require some thought, more
> importantly proper catalog organization for the same.

I do not think this is a particularly good idea.  You're going to need
to dump each partition separately at least in --binary-upgrade mode,
because each is going to have its own magic OIDs that need to be
restored, and also because there will most likely be at least some
properties that are going to vary between partitions.  You could
require that every partition have exactly the same set of columns,
constraints, rules, triggers, policies, attribute defaults, comments,
column comments, and everything else that might be different from one
partition to another, and further require that they have exactly
matching indexes.  It would take a fair amount of code to prohibit all
that, but it could be done.  However, do we really want that?  There
may well be some things were we want to enforce that the parent and
the child are exactly identical, but I doubt we want that for
absolutely every property, current and future, of the partition.  And
even if you did, because of the --binary-upgrade stuff, you still need
to to be able to dump them separately.

Therefore, I believe it is a whole lot better to make the primary
syntax for table partitioning something where you issue a CREATE
statement for the parent and then a CREATE statement for each child.
If we want to also have a convenience syntax so that people who want
to create a parent and a bunch of children in one fell swoop can do
so, fine.

I would not choose to model the syntax for creating partitions on
Oracle.  I don't find that syntax particularly nice or easy to
remember.  I say PARTITION BY RANGE, and then inside the parentheses I
use the PARTITION keyword for each partition?  Really?  But I think
copying the style while having the details be incompatible is an even
worse idea.

> What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied
> directly to partitions on case-by-case basis (they are tables under the
> hood after all), we should restrict AT to the master table. Most of the AT
> changes implicitly propagate from the master table to its partitions. Some
> of them could be directly applied to partitions and/or sub-partitions such
> as rename, storage manipulations like - changing tablespace, storage
> parameters (reloptions), etc.:
>
> ALTER TABLE foo
> RENAME PARTITION  TO ;
>
> ALTER TABLE foo
> RENAME SUBPARTITION  TO ;
>
> ALTER TABLE foo
> SET TABLESPACE ... [DEFAULT] FOR PARTITION ;
>
> ALTER TABLE foo
> SET TABLESPACE ... FOR SUBPARTITION ;
>
> ALTER TABLE foo
> SET (storage_parameter = value) [DEFAULT] FOR PARTITION ;
>
> ALTER TABLE foo
> SET (storage_parameter = value) FOR SUBPARTITION ;

I don't think this is a very good idea.  This is basically proposing
that for every DDL command that you can apply to a table, you have to
spell it differently for a partition.  That seems like a lot of extra
work for no additional functionality.

> By the way, should we also allow changing the logging of
> partitions/sub-partitions as follows?

Again, I think you're coming at this from the wrong direction.
Instead of saying we're going to disallow all changes to the
partitions and then deciding we need to allow certain changes after
all, I think we should allow everything that is currently allowed for
an inherited table and then decide which of those things we need to
prohibit, and why.  For example, if you insist that a child table has
to have a tuple descriptor that matches the parent, that can improve
efficiency: Append won't need to project, and so on.  But it now
becomes very difficult to support taking a stand-alone table and
making it a partition of an existing partitioned table, because the
set of dropped columns might not match.  Having to give an error in
that case amounts to "we're sorry, we can't attach your partition to
the partitioning hierarchy because of some invisible state that you
can't see" isn't very nice.  Now I'm not saying that isn't the right
decision, but I think the design choices here need to be carefully
thought about.

Stepping away from that particular example, a blanket prohibition on
changing any 

Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-12-17 Thread Pavel Stehule
2015-12-14 23:09 GMT+01:00 Daniel Verite :

> Pavel Stehule wrote:
>
> > postgres=# \crosstabview 4 +month label
> >
> > Maybe using optional int order column instead label is better - then you
> can
> > do sort on client side
> >
> > so the syntax can be "\crosstabview VCol [+/-]HCol [[+-]HOrderCol]
>
> In the meantime I've followed a different idea: allowing the
> vertical header to be sorted too, still server-side.
>
> That's because to me, the first impulse for a user noticing that
> it's not sorted vertically would be to write
>  \crosstabview +customer month
> rather than figure out the
>  \crosstabview customer +month_number month_name
> invocation.
> But both ways aren't even mutually exclusive. We could support
>  \crosstabview [+|-]colV[:labelV] [+|-]colH[:labelH]
> it's more complicated to understand, but not  harder to implement.
>
> Also, a non-zero FETCH_COUNT is supported by this version of the patch,
> if the first internal FETCH retrieves less than FETCH_COUNT rows.
> Otherwise a specific error is emitted.
>
> Also there are minor changes in arguments and callers following
> recent code changes for \o
>
> Trying to crosstab with 10k+ distinct values vertically, I've noticed
> that the current code is too slow, spending too much time
> sorting.  I'm currently replacing its simple arrays of distinct values
> with AVL binary trees, which I expect to be much more efficient for
> this.
>

I played with last version and it is looking well. I have only one notice,
but it is subjective - so can be ignored if you don't like it.

The symbol 'X' in two column mode should be centred - now it is aligned to
left, what is not nice. For unicode line style I prefer some unicode symbol
- your chr(10003) is nice.

Regards

Pavel



>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: Fwd: [HACKERS] Cluster "stuck" in "not accepting commands to avoid wraparound data loss"

2015-12-17 Thread Robert Haas
On Thu, Dec 17, 2015 at 1:20 PM, Andres Freund  wrote:
> On 2015-12-17 13:08:15 -0500, Robert Haas wrote:
>> On Thu, Dec 17, 2015 at 12:14 PM, Andres Freund  wrote:
>> > On 2015-12-17 09:04:25 -0800, Jeff Janes wrote:
>> >> > But I'm somewhat confused what this has to do with Andres's report.
>> >>
>> >> Doesn't it explain the exact situation he is in, where the oldest
>> >> database is 200 million, but the cluster as a whole is 2 billion?
>> >
>> > There were no crashes, so no, I don't think so.
>>
>> Backing up a step, do we think that the fact that this was running in
>> a shell rather than a screen is relevant somehow?  Or did something
>> happen to this particular cluster totally unrelated to that?
>
> I reran the whole thing on a separate, but very similar, VM. Just
> checked. Same thing happened. This time I have log files and
> everything. No time to investigate right now, but it's reproducible if
> you accept running tests for a week or so.

I don't think I'm going to speculate further until you have time to
investigate more.  It seems clear that autovacuum is going wrong
somehow, but it's extremely unclear why.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql - -dry-run option

2015-12-17 Thread Pavel Stehule
2015-12-17 20:03 GMT+01:00 Tom Lane :

> Pavel Stehule  writes:
> > when I read a blog
> >
> http://www.depesz.com/2015/12/14/waiting-for-9-6-psql-support-multiple-c-and-f-options-and-allow-mixing-them/
> > where is emulated dry-run mode, I though so we can implement it very
> > simply.
>
> Not one that is actually reliable.  All a script would have to do is
> include its own begin/commit commands, and it would override what you
> are talking about.  It's okay, in my opinion, if the -1 switch is just a
> half-baked "best effort" solution.  It's not okay to provide a --dry-run
> switch that is equally full of holes, because if someone were to actually
> rely on it to not execute the script, the possibility of an override would
> amount to a security bug.
>

My idea was enforce global transaction (-1) option and ensure STOP_ON_ERROR
mode (cannot be changed later). Any inner COMMIT or ROLLBACK have to be
disallowed (or ignored) - what can be problem :(

and if all statements from input stream was processed, then ROLLBACK is
emitted, but result is success.

Pavel


>
> regards, tom lane
>


Re: [HACKERS] psql - -dry-run option

2015-12-17 Thread Joe Conway
On 12/17/2015 11:58 AM, Christopher Browne wrote:
> On 17 December 2015 at 14:16, Pavel Stehule  > wrote:
>> or different idea - just enforce syntax check without execution.
> 
> That seems pretty cool...  I'd find "syntax check without execution" to
> be pretty useful to test SQL (and especially DDL).

Like this?
https://github.com/jconway/pgsynck

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] psql - -dry-run option

2015-12-17 Thread Tom Lane
Pavel Stehule  writes:
> when I read a blog
> http://www.depesz.com/2015/12/14/waiting-for-9-6-psql-support-multiple-c-and-f-options-and-allow-mixing-them/
> where is emulated dry-run mode, I though so we can implement it very
> simply.

Not one that is actually reliable.  All a script would have to do is
include its own begin/commit commands, and it would override what you
are talking about.  It's okay, in my opinion, if the -1 switch is just a
half-baked "best effort" solution.  It's not okay to provide a --dry-run
switch that is equally full of holes, because if someone were to actually
rely on it to not execute the script, the possibility of an override would
amount to a security bug.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] broken autocomplete in head

2015-12-17 Thread Pavel Stehule
Hi

it looks like backslash is ignored, and then autocomplete for backslash
commands is off.

Regards

Pavel


Re: [HACKERS] broken autocomplete in head

2015-12-17 Thread Tom Lane
Pavel Stehule  writes:
> it looks like backslash is ignored, and then autocomplete for backslash
> commands is off.

Turns out adding rl_initialize() did that :-(.  Fixed now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove array_nulls?

2015-12-17 Thread Robert Haas
On Wed, Dec 16, 2015 at 10:48 PM, Jim Nasby  wrote:
> IIUC, that means supporting backwards compat. GUCs for 10 years, which seems
> a bit excessive. Granted, that's about the worse-case scenario for what I
> proposed (ie, we'd still be supporting 8.0 stuff right now).

Not to me.  GUCs like array_nulls don't really cost much - there is no
reason to be in a hurry about removing them that I can see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PoC] Asynchronous execution again (which is not parallel)

2015-12-17 Thread Mart Kelder
Hi Robert and others,

First, I currently don't know the postgresql code well enough yet. I still 
hope my toughts are usefull.

Robert Haas wrote:
> It is unclear to me how useful this is beyond ForeignScan, Gather, and
> Append.  MergeAppend's ordering constraint makes it less useful; we
> can asynchronously kick off the request for the next tuple before
> returning the previous one, but we're going to need to have that tuple
> before we can return the next one.  But it could be done.  It could
> potentially even be applied to seq scans or index scans using some set
> of asynchronous I/O interfaces, but I don't see how it could be
> applied to joins or aggregates, which typically can't really proceed
> until they get the next tuple.  They could be plugged into this
> interface easily enough but it would only help to the extent that it
> enabled asynchrony elsewhere in the plan tree to be pulled up towards
> the root.

As far as I understand, this comes down to a number of subplans. The subplan 
can be asked to return a tuple directly or at some later point in time 
(asynchrone). Conceptionally, the subplan goes in a "tuples wanted" state 
and it starts it works that need to be done to receive that tuple. Later, it 
either returns the tuple or the message that there are no tuples.

I see opportunities to use this feature to make some queryplans less memory 
intensive without increasing the total amount of work to be done. I think 
the same subplan can be placed at several places in the execution plan. If 
the subplan ends with a tuple store, then if a row is requested, it can 
either return it from store, or generate it in the subplan. If both outputs 
of the subplan request tuples at around the same rate, the tuple store size 
is limited where we currently need to save all the tuples or generate the 
tuples multiple times. I think this can be potentionally beneficial for 
CTE's.

I also think the planner can predict what the chances are that a subplan can 
be reused. If both subplans are on a different side of a hash join, it can 
know that one output will be exhausted before the second output will request 
the first row. That would mean that the everything needs to be stored. Also, 
not every callback needs to be invoked at the same rate: tuple storage might 
be avoided by choosing the callback to invoke wisely.

I am a little bit worried about the performance as a result of context 
switching. I think it is a good idea to only register the callback if it 
actually hits a point where the tuple can't be generated directly.

> Thoughts?

Regards,

Mart



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: postpone building buckets to the end of Hash (in HashJoin)

2015-12-17 Thread Tomas Vondra

Hi,

On 12/17/2015 07:20 PM, Robert Haas wrote:
...


If this doesn't regress performance in the case where the number of
buckets is estimated accurately to begin with, then I think this is
a great idea. Can you supply some performance tests results for that
case, and maybe some of the other cases also?


I don't see how it could regress performance, and the benchmarks I've 
done confirm that. I'll do more thorough benchmarking and post the 
results here, but not now as this patch is in 2016-01 CF and I want to 
put all my time into reviewing patches from the open commitfest.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] broken autocomplete in head

2015-12-17 Thread Pavel Stehule
2015-12-17 22:56 GMT+01:00 Tom Lane :

> Pavel Stehule  writes:
> > it looks like backslash is ignored, and then autocomplete for backslash
> > commands is off.
>
> Turns out adding rl_initialize() did that :-(.  Fixed now.
>

Thank you

Regards

Pavel


>
> regards, tom lane
>


Re: [HACKERS] Comment typo in pg_upgrade.c

2015-12-17 Thread Amit Langote
On 2015/12/18 14:53, Amit Langote wrote:
> In prepare_new_cluster(), following looks like a typo:
> 
> - * would cause us to lose the frozenids restored by the load. We use
> + * would cause us to lose the frozenxids restored by the load. We use

Or maybe not, because "frozenids" is supposed to refer to both xids and mxids.

Sorry about the noise.

Thanks,
Amit





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-12-17 Thread Michael Paquier
On Fri, Dec 18, 2015 at 3:17 AM, Masahiko Sawada  wrote:
> On Thu, Dec 17, 2015 at 11:47 AM, Michael Paquier  
> wrote:
>> make check with pg_upgrade is failing for me:
>> Visibility map rewriting test failed
>> make: *** [check] Error 1
>
> make check with pg_upgrade is done successfully on my environment.
> Could you give me more information about this?

Oh, well I see now after digging into your code. You are missing -X
when running psql, and until recently psql -c implied -X all the time.
The reason why it failed for me is that I have \timing enabled in
psqlrc.

Actually test.sh needs to be fixed as well, see the attached, this is
a separate bug. Could a kind committer look at that if this is
acceptable?

>> Sawada-san, are you planning to continue working on that? At this
>> stage it seems that this patch is not in commitable state and should
>> at best be moved to next CF, or at worst returned with feedback.
>
> Yes, of course.
> This patch should be marked as "Move to next CF".

OK, done so.
-- 
Michael


pgupgrade-fix.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2015-12-17 Thread Michael Paquier
On Fri, Dec 18, 2015 at 4:31 PM, Peter Geoghegan  wrote:
> On Thu, Dec 17, 2015 at 11:06 PM, Michael Paquier
>  wrote:
>> Now per the two points listed in
>> the first sentence in this paragraph, perhaps this opinion is fine as
>> moot :) I didn't get much involved in the development of this code
>> after all.
>
> I'm concerned that Heikki's apparent unavailability will become a
> blocker to this.

Yeah, not only this patch... The second committer with enough
background on the matter is Andres.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Comment typo in pg_upgrade.c

2015-12-17 Thread Amit Langote
In prepare_new_cluster(), following looks like a typo:

- * would cause us to lose the frozenids restored by the load. We use
+ * would cause us to lose the frozenxids restored by the load. We use

Attached find the patch.

Thanks,
Amit
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 4a7281b..7f8b75f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -238,7 +238,7 @@ prepare_new_cluster(void)
 {
 	/*
 	 * It would make more sense to freeze after loading the schema, but that
-	 * would cause us to lose the frozenids restored by the load. We use
+	 * would cause us to lose the frozenxids restored by the load. We use
 	 * --analyze so autovacuum doesn't update statistics later
 	 */
 	prep_status("Analyzing all rows in the new cluster");

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Typo in the comment above heap_prepare_freeze_tuple()

2015-12-17 Thread Amit Langote
I think the following may be a typo:

  * Caller is responsible for ensuring that no other backend can access the
  * storage underlying this tuple, either by holding an exclusive lock on the
- * buffer containing it (which is what lazy VACUUM does), or by having it by
+ * buffer containing it (which is what lazy VACUUM does), or by having it be
  * in private storage (which is what CLUSTER and friends do).

If so, attached is the patch.

Thanks,
Amit
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ff7a41..3c2878b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6547,7 +6547,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  *
  * Caller is responsible for ensuring that no other backend can access the
  * storage underlying this tuple, either by holding an exclusive lock on the
- * buffer containing it (which is what lazy VACUUM does), or by having it by
+ * buffer containing it (which is what lazy VACUUM does), or by having it be
  * in private storage (which is what CLUSTER and friends do).
  *
  * Note: it might seem we could make the changes without exclusive lock, since

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_replication log positions vs base backups

2015-12-17 Thread Michael Paquier
On Wed, Dec 16, 2015 at 9:35 PM, Michael Paquier
 wrote:
> On Wed, Dec 16, 2015 at 7:14 PM, Magnus Hagander  wrote:
>> On Wed, Dec 16, 2015 at 8:34 AM, Michael Paquier 
>> wrote:
>>> Interesting. I got just today a bug report that is actually a symptom
>>> that people should be careful about: it is possible that
>>> pg_stat_replication reports 1/potential for sync_priority/sync_state
>>> in the case of a WAL sender in "backup" state: a base backup just
>>> needs to reuse the shared memory slot of a standby that was previously
>>> connected. Commit 61c7bee of Magnus fixes the issue, just let's be
>>> careful if there are similar reports that do not include this fix.
>>
>>
>> Hmm. With the fix, it returns "async", right?
>
> Yes, it returns async with priority set at 0 after your commit. That's
> a bit better than the old behavior, still..
>
>> Perhaps it should return either "backup" or NULL, to be even more clear? And
>> with priority set to NULL?
>
> I'd vote for just NULL for both fields. This async/sync status has no
> real sense for a backup. Thoughts?

While looking again at that, considering a WAL sender having an
invalid WAL flush position as always asynchronous like a base backup
one is directly mentioned in walsender.c, per 0c4b468:
+   /*
+* Treat a standby such as a pg_basebackup
background process
+* which always returns an invalid flush location, as an
+* asynchronous standby.
+*/
The point is just to be sure that such a node won't be selected as a
sync standby, while it is a surprising behavior, but after this many
years it may not be worth switching to NULL values.

Fujii-san, thoughts? You introduced this behavior after all.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2015-12-17 Thread Michael Paquier
On Thu, Dec 17, 2015 at 4:55 PM, Peter Geoghegan  wrote:
> On Wed, Dec 16, 2015 at 11:44 PM, Peter Geoghegan  wrote:
>>> In any case, at this point 9.5 is really aimed to be stabilized, so
>>> targeting only master is a far saner approach IMO for this patch.
>>> Pushing that in 9.5 a couple of months back may have given enough
>>> reason to do so... But well life is life.
>>
>> No, this really isn't an optimization at all.
>
> I should add: I think that the chances of this patch destabilizing the
> code are very slim, once it receives the proper review. Certainly, I
> foresee no possible downside to not inserting the doomed IndexTuple,
> since it's guaranteed to have its heap tuple super-deleted immediately
> afterwards.

I am no committer, just a guy giving an opinion about this patch and I
have to admit that I have not enough studied the speculative insertion
code to have a clear technical point of view on the matter, but even
if the chances of destabilizing the code are slim, it does not seem a
wise idea to me to do that post-rc1 and before a GA as there are
people testing the code as it is now. Now per the two points listed in
the first sentence in this paragraph, perhaps this opinion is fine as
moot :) I didn't get much involved in the development of this code
after all.

> That's the only real behavioral change proposed here. So, I would
> prefer it if we got this in before the first stable release of 9.5.

Yeah, I got this one.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-17 Thread Michael Paquier
On Thu, Dec 17, 2015 at 6:04 PM, Thomas Munro
 wrote:
> On Thu, Dec 17, 2015 at 7:24 PM, Michael Paquier
>  wrote:
> Kyotaro's suggestion of using a macro NEG x to avoid complicating the
> string constants seems good to me.  But perhaps like this?
>
>   TailMatches4("COMMENT", "ON", MatchAny, MatchAnyExcept("IS"))
>
> See attached patch which does it that way.

Fine for me.

>>> Addition to that, I feel that successive "MatchAny"s are a bit
>>> bothersome.
>>>
  TailMatches6("COMMENT", "ON", MatchAny, MatchAny, MatchAny, MatchAny)) &&
  !TailMatches1("IS")
>>>
>>> Is MachAny acceptable? On concern is the two n's
>>> (TailMatches and MatchAny) looks a bit confising.
>>>
  TailMatches4("COMMENT", "ON", MatchAny3, "!IS")
>>
>> Ah, OK, so you would want to be able to have an inner list, MatchAnyN
>> meaning actually a list of MatchAny items repeated N times. I am not
>> sure if that's worth it.. I would just keep it simple, and we are just
>> discussing about a couple of places only that would benefit from that.
>
> +1 for simple.

+#define CompleteWithList10(s1, s2, s3, s4, s5, s6, s7, s8, s9, s10)\
+do { \
+   static const char *const list[] = { s1, s2, s3, s4, s5, s6,
s7, s8, s9, s10, NULL }; \
+   completion_charpp = list; \
+   completion_case_sensitive = false; \
+   matches = completion_matches(text, complete_from_list); \
+} while (0)
Actually we are not using this one. I am fine if this is kept, just
worth noting.

OK, I am marking that as ready for committer. Let's see what happens next.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2015-12-17 Thread Peter Geoghegan
On Thu, Dec 17, 2015 at 11:06 PM, Michael Paquier
 wrote:
>> I should add: I think that the chances of this patch destabilizing the
>> code are very slim, once it receives the proper review. Certainly, I
>> foresee no possible downside to not inserting the doomed IndexTuple,
>> since it's guaranteed to have its heap tuple super-deleted immediately
>> afterwards.
>
> I am no committer, just a guy giving an opinion about this patch and I
> have to admit that I have not enough studied the speculative insertion
> code to have a clear technical point of view on the matter, but even
> if the chances of destabilizing the code are slim, it does not seem a
> wise idea to me to do that post-rc1 and before a GA as there are
> people testing the code as it is now.

You can express doubt about almost anything. In this case, you haven't
voiced any particular concern about any particular risk. The risk of
not proceeding is that 9.5 will remain in a divergent state for no
reason, with substantial differences in the documentation of the AM
interface, and that has a cost. Why should the risk of destabilizing
9.5 become more palatable when 9.5 has been out for 6 months or a
year? You can take it that this will probably be now-or-never.

This is mostly just comment changes, and changes to documentation. If
it comes down to it, we could leave the existing 9.5 code intact (i.e.
still unnecessarily insert the IndexTuple), while commenting that it
is unnecessary, and still changing everything else. That would have an
unquantifiably tiny risk, certainly smaller than the risk of
committing the patch as-is (which, to reiterate, is a risk that I
think is very small).

FWIW, I tend to doubt that users have tested speculative insertion/ON
CONFLICT much this whole time. There were a couple of bug reports, but
that's it.

> Now per the two points listed in
> the first sentence in this paragraph, perhaps this opinion is fine as
> moot :) I didn't get much involved in the development of this code
> after all.

I'm concerned that Heikki's apparent unavailability will become a
blocker to this.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Function and view to retrieve WAL receiver status

2015-12-17 Thread Robert Haas
On Mon, Dec 14, 2015 at 7:23 PM, Michael Paquier
 wrote:
> On Tue, Dec 15, 2015 at 5:27 AM, Gurjeet Singh wrote:
>> The function, maybe. But emitting an all-nulls row from a view seems
>> counter-intuitive, at least when looking at it in context of relational
>> database.
>
> OK, noted. Any other opinions?

I wouldn't bother with the view.  If we're going to do it, I'd say
just provide the function and let people SELECT * from it if they want
to.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extend pgbench expressions with functions

2015-12-17 Thread Michael Paquier
On Thu, Dec 17, 2015 at 10:33 PM, Fabien COELHO  wrote:
>
>>> (2a) remove double stuff, just keep integer functions.
>>>  I would rather keep min/max, though.
>>
>>
>> (2a) sounds like a fine plan to get something committable. We could keep
>> min/max/abs, and remove sqrt/pi. What's actually the use case for debug? I
>> cannot wrap my mind on one?
>
>
> It was definitely useful to debug the double/int type stuff within
> expressions when writing a non trivial pgbench script. It is probably less
> interesting if there are only integers.

OK for me. Let's leave without it as well if you think it is not that
useful. It could still be introduced later on.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extend pgbench expressions with functions

2015-12-17 Thread Michael Paquier
On Fri, Dec 18, 2015 at 9:08 AM, Michael Paquier
 wrote:
> On Thu, Dec 17, 2015 at 10:33 PM, Fabien COELHO  wrote:
>>
 (2a) remove double stuff, just keep integer functions.
  I would rather keep min/max, though.
>>>
>>>
>>> (2a) sounds like a fine plan to get something committable. We could keep
>>> min/max/abs, and remove sqrt/pi. What's actually the use case for debug? I
>>> cannot wrap my mind on one?
>>
>>
>> It was definitely useful to debug the double/int type stuff within
>> expressions when writing a non trivial pgbench script. It is probably less
>> interesting if there are only integers.
>
> OK for me. Let's leave without it as well if you think it is not that
> useful. It could still be introduced later on.

s/leave/live. This morning is a bit hard...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Small fix in pg_rewind (redundant declaration)

2015-12-17 Thread YUriy Zhuravlev
Hello hackers.
I've stumbled upon a strange code.
In src/bin/pg_rewind/datapagemap.h  we decalre:
extern void datapagemap_destroy(datapagemap_t *map);
But nowhere is implemented. I think the declaration of this function must be 
removed.
I'm not sure that this trivial things needed patch.

Thanks. 

-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extend pgbench expressions with functions

2015-12-17 Thread Michael Paquier
On Thu, Dec 17, 2015 at 6:02 PM, Fabien COELHO  wrote:
>> Really, I like this patch, and I think that it is great to be able to
>> use a set of functions and methods within a pgbench script because
>> this clearly can bring more clarity for a developer, still it seems
>> that this patch is trying to do too many things at the same time:
>> 1) Add infrastructure to support function calls and refactor the
>> existing functions to use it. This is the FUNCTION stuff in the
>> expression scanner.
>> 2) Add the concept of double return type, this could be an extension
>> of \set with a data type, or a new command like \setdouble. This
>> should include as well the casting routines I guess. This is the
>> DOUBLE stuff in the expression scanner.
>> 3) Introduce new functions, as needed.
>
>
> Yep. I started with (1), and was *ASKED* to do the others.

That's the beginning of the thread... N developers have (N+1) opinions
as a wise guy watching this mailing list said once.

>> 1) and 2) seem worthwhile to me. 3) may depend on the use cases we'd
>> like to have.. sqrt has for example value if a variable can be set
>> double as a double type.
>
>
> Sure. This is just an example of a double function so that if someone wants
> to add another one they can just update where it make sense. Maybe log/exp
> would make more sense for pgbench.

Perhaps..

>> In conclusion, for this CF the patch doing the documentation fixes is
>> "Ready for committer", the second patch introducing the function
>> infrastructure should focus more on its core structure and should be
>> marked as "Returned with feedback". Opinions are welcome.
>
>
> My opinion is that to do and undo work because of random thoughts on the
> list is tiring.
>
> What I can do is:
>
> (1) fix the doc and bugs if any, obviously.

Thanks.

> (2a) remove double stuff, just keep integer functions.
>  I would rather keep min/max, though.

(2a) sounds like a fine plan to get something committable. We could
keep min/max/abs, and remove sqrt/pi. What's actually the use case for
debug? I cannot wrap my mind on one?

> (2b) keep the patch with both int & double as is.

Functions returning double are not that useful... pgbench can live without them.

> What I will *NOT* do is to add double variables without a convincing use
> case.

I am not actually meaning that you should do it. I scrutinized the
thread and this patch and thought that there could be perhaps be some
use cases for double return values after seeing the cast functions. It
is not my attention to piss you off, I like this patch a lot, and you
have already proved that if someone wanted to get those additional
features, well the infrastructure being put in place would allow doing
that :)
If the second patch gets in a simplified shape I'll look at it again
then let's move on with a committer's opinion.

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_tables bug?

2015-12-17 Thread Michael Paquier
On Thu, Dec 17, 2015 at 4:54 PM, Gaetano Mendola  wrote:
> I'm playing around with tablespace (postgresq 9.4) and I found out what I
> believe is a bug in pg_tables.
> Basically if you create a database in a table space X and then you create a
> table on the database the table is created correctly on the tablespace X ( I
> did a check on the filesystem) however if you do a select on pg_tables the
> column tablespace for that table is empty and even worst if you dump the DB
> there is no reporting about the the database or table being on that
> tablespace.
> Even \d doesn't report that the table is in the tablespace X.

Are you sure you created the table in a tablespace? See for example:
=# create tablespace popo location '/to/tbspace/path';
CREATE TABLESPACE
=# create table aa (a int) tablespace popo;
CREATE TABLE
=# \d aa
  Table "public.aa"
 Column |  Type   | Modifiers
+-+---
 a  | integer |
Tablespace: "popo"
=# select tablename, tablespace from pg_tables where tablename = 'aa';
 tablename | tablespace
---+
 aa| popo
(1 row)

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: bloom filter in Hash Joins with batches

2015-12-17 Thread Simon Riggs
On 15 December 2015 at 22:30, Tomas Vondra 
wrote:

  3) Currently the bloom filter is used whenever we do batching, but it
>  should really be driven by selectivity too - it'd be good to (a)
>  estimate the fraction of 'fact' tuples having a match in the hash
>  table, and not to do bloom if it's over ~60% or so. Also, maybe
>  the could should count the matches at runtime, and disable the
>  bloom filter if we reach some threshold.
>

Cool results.

It seems a good idea to build the bloom filter always, then discard it if
it would be ineffective.

My understanding is that the bloom filter would be ineffective in any of
these cases
* Hash table is too small
* Bloom filter too large
* Bloom selectivity > 50% - perhaps that can be applied dynamically, so
stop using it if it becomes ineffective

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-12-17 Thread Pavel Stehule
2015-12-14 23:09 GMT+01:00 Daniel Verite :

> Pavel Stehule wrote:
>
> > postgres=# \crosstabview 4 +month label
> >
> > Maybe using optional int order column instead label is better - then you
> can
> > do sort on client side
> >
> > so the syntax can be "\crosstabview VCol [+/-]HCol [[+-]HOrderCol]
>
> In the meantime I've followed a different idea: allowing the
> vertical header to be sorted too, still server-side.
>
> That's because to me, the first impulse for a user noticing that
> it's not sorted vertically would be to write
>  \crosstabview +customer month
> rather than figure out the
>  \crosstabview customer +month_number month_name
> invocation.
> But both ways aren't even mutually exclusive. We could support
>  \crosstabview [+|-]colV[:labelV] [+|-]colH[:labelH]
> it's more complicated to understand, but not  harder to implement.
>

yes, I was able to do what I would - although the query was little bit
strange

 select amount, label, customer from (select sum(amount) as amount,
extract(month from closed)::int - 1 as Month, to_char(date_trunc('month',
closed), 'TMmon') as label, customer from data group by customer,
to_char(date_trunc('month', closed), 'TMmon'), extract(month from closed)
union select sum(amount), extract(month from closed)::int - 1 as month,
to_char(date_trunc('month', closed), 'TMmon') as label, ' TOTAL '
from data group by to_char(date_trunc('month', closed), 'TMmon'),
extract(month from closed)::int - 1 order by  month) x

 \crosstabview +3 2


>
> Also, a non-zero FETCH_COUNT is supported by this version of the patch,
> if the first internal FETCH retrieves less than FETCH_COUNT rows.
> Otherwise a specific error is emitted.
>

good idea

>
> Also there are minor changes in arguments and callers following
> recent code changes for \o
>
> Trying to crosstab with 10k+ distinct values vertically, I've noticed
> that the current code is too slow, spending too much time
> sorting.  I'm currently replacing its simple arrays of distinct values
> with AVL binary trees, which I expect to be much more efficient for
> this.
>

Regards

Pavel


>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: [HACKERS] [patch] Proposal for \rotate in psql

2015-12-17 Thread Pavel Stehule
2015-12-14 23:15 GMT+01:00 Daniel Verite :

> Pavel Stehule wrote:
>
>
> > here is patch - supported syntax: \crosstabview VCol [+/-]HCol
> [HOrderCol]
> >
> > Order column should to contains any numeric value. Values are sorted on
> > client side
>
> If I understand correctly, I see a problem with HOrderCol.
>
> If the vertical header consists of, for example, a series of
> event names, and it should be sorted  by event date, then
> the requirement of HOrderCol being strictly numeric is
> problematic,  in a way that the previous proposal was not, isn't it?
>

I don't think - If you are able to do sort on server side, then you can use
window functions and attach some numeric in correct order.

But the situation is more simple probably - usually you are able to
transform the field for order to number, so you don't need window functions
- the transform function is enough.

Regards

Pavel


>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: [HACKERS] psql - -dry-run option

2015-12-17 Thread Christopher Browne
On 17 December 2015 at 14:16, Pavel Stehule  wrote:
> or different idea - just enforce syntax check without execution.

That seems pretty cool...  I'd find "syntax check without execution" to be
pretty useful to test SQL (and especially DDL).

--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: [HACKERS] psql - -dry-run option

2015-12-17 Thread Tom Lane
Christopher Browne  writes:
> On 17 December 2015 at 14:16, Pavel Stehule  wrote:
>> or different idea - just enforce syntax check without execution.

> That seems pretty cool...  I'd find "syntax check without execution" to be
> pretty useful to test SQL (and especially DDL).

If it didn't execute the DDL at all, I doubt it would be that useful ---
you could not test any statements that depended on earlier statements.
Moreover, people might get surprised by error checks that they expect
to get reported by the "syntax check" but actually are not made until
runtime.  There's a lot of behavior there that's currently just
implementation detail but would become user-visible, depending on just
where the syntax check stops processing.

So what you want I think is something that *does* execute everything,
but within a single transaction that is guaranteed not to get committed.
A bulletproof version of that would likely need to be implemented on the
server side, not with some psql hack.

Whether we really need a feature like that isn't clear though; it's not
like it's hard to test things that way now.  Stick in a BEGIN with no
COMMIT, you're there.  The problem only comes in if you start expecting
the behavior to be bulletproof.  Maybe I'm being too pessimistic about
what people would believe a --dry-run switch to be good for ... but
I doubt it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel joins, and better parallel explain

2015-12-17 Thread Robert Haas
On Thu, Dec 17, 2015 at 12:33 AM, Amit Kapila  wrote:
> While looking at plans of Q5 and Q7, I have observed that Gather is
> pushed below another Gather node for which we don't have appropriate
> way of dealing.  I think that could be the reason why you are seeing
> the errors.

Uh oh.  That's not supposed to happen.  A GatherPath is supposed to
have parallel_safe = false, which should prevent the planner from
using it to form new partial paths.  Is this with the latest version
of the patch?  The plan output suggests that we're somehow reaching
try_partial_hashjoin_path() with inner_path being a GatherPath, but I
don't immediately see how that's possible, because
create_gather_path() sets parallel_safe to false unconditionally, and
hash_inner_and_outer() never sets cheapest_safe_inner to a path unless
that path is parallel_safe.

Do you have a self-contained test case that reproduces this, or any
insight as to how it's happening here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small fix in pg_rewind (redundant declaration)

2015-12-17 Thread Michael Paquier
On Fri, Dec 18, 2015 at 10:22 AM, Tom Lane  wrote:
> Done.

Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small fix in pg_rewind (redundant declaration)

2015-12-17 Thread Michael Paquier
On Thu, Dec 17, 2015 at 9:23 PM, YUriy Zhuravlev
 wrote:
> Hello hackers.
> I've stumbled upon a strange code.
> In src/bin/pg_rewind/datapagemap.h  we decalre:
> extern void datapagemap_destroy(datapagemap_t *map);
> But nowhere is implemented. I think the declaration of this function must be
> removed.
> I'm not sure that this trivial things needed patch.

Nicely noticed. That's a bug present as well in the 9.3 and 9.4
versions of pg_rewind (just fixed it there). datapagemap_create can be
additionally removed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small fix in pg_rewind (redundant declaration)

2015-12-17 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Dec 17, 2015 at 9:23 PM, YUriy Zhuravlev
>  wrote:
>> Hello hackers.
>> I've stumbled upon a strange code.
>> In src/bin/pg_rewind/datapagemap.h  we decalre:
>> extern void datapagemap_destroy(datapagemap_t *map);
>> But nowhere is implemented. I think the declaration of this function must be
>> removed.
>> I'm not sure that this trivial things needed patch.

> Nicely noticed. That's a bug present as well in the 9.3 and 9.4
> versions of pg_rewind (just fixed it there). datapagemap_create can be
> additionally removed.

Done.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-12-17 Thread Peter Geoghegan
On Wed, Dec 16, 2015 at 12:04 PM, Peter Geoghegan  wrote:
>> What kind of state is that?  Can't we define this in terms of what it
>> is rather than how it gets that way?
>
> It's zeroed.
>
> I guess we can update everything, including existing comments, to reflect 
> that.

Attached revision updates both the main commit (the optimization), and
the backpatch commit (updated the contract).

Changes:

* Changes commit intended for backpatch to 9.5 to use new "zeroed"
terminology (not "void" representation).

* Puts fix within mergerun() (for this patch) in the same backpatch
commit. We might as well keep this consistent, even though the
correctness of 9.5 does not actually hinge upon having this change --
there is zero risk of breaking something in 9.5, and avoiding
backpatching this part can only lead to confusion in the future.

* Some minor tweaks to tuplesort.c. Make sure that
tuplesort_putdatum() zeroes datum1 directly for NULL values. Comment
tweaks.

* Add comment to the datum case, discussing restriction there.

I was wrong when I said that the datum sort case currently tacitly
acknowledges as possible/useful something that the revised SortSupport
contract will forbid (I wrote that e-mail in haste).

What I propose to make the SortSupport contract forbid here is
abbreviated keys that are *themselves* pass-by-reference. It is true
that today, we do not support pass-by-value types with abbreviated
keys for the datum case only (such opclass support could be slightly
useful for float8, for example -- int8 comparisons of an alternative
representation might be decisively cheaper). But that existing
restriction is entirely distinct to the new restriction I propose to
create. It seems like this distinction should be pointed out in
passing, which I've done.

-- 
Peter Geoghegan
From c740eac2a4ba42ac2b5ab34c342534ff8d944666 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Mon, 6 Jul 2015 13:37:26 -0700
Subject: [PATCH 2/2] Reuse abbreviated keys in ordered [set] aggregates

When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.  Only strict abbreviated key binary inequality is considered,
which is safe.
---
 src/backend/catalog/index.c|  2 +-
 src/backend/executor/nodeAgg.c | 20 
 src/backend/executor/nodeSort.c|  2 +-
 src/backend/utils/adt/orderedsetaggs.c | 33 +
 src/backend/utils/sort/tuplesort.c | 44 --
 src/include/utils/tuplesort.h  |  4 ++--
 6 files changed, 79 insertions(+), 26 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c10be3d..c6737c2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3073,7 +3073,7 @@ validate_index_heapscan(Relation heapRelation,
 			}
 
 			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-  _val, _isnull);
+  _val, _isnull, NULL);
 			Assert(tuplesort_empty || !ts_isnull);
 			if (!tuplesort_empty)
 			{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..2972180 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -539,7 +539,8 @@ fetch_input_tuple(AggState *aggstate)
 
 	if (aggstate->sort_in)
 	{
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, aggstate->sort_slot))
+		if (!tuplesort_gettupleslot(aggstate->sort_in, true,
+	aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
@@ -894,8 +895,8 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
  * The one-input case is handled separately from the multi-input case
  * for performance reasons: for single by-value inputs, such as the
  * common case of count(distinct id), the tuplesort_getdatum code path
- * is around 300% faster.  (The speedup for by-reference types is less
- * but still noticeable.)
+ * is around 300% faster.  (The speedup for by-reference types without
+ * abbreviated key support is less but still noticeable.)
  *
  * This function handles only one grouping set (already set in
  * aggstate->current_set).
@@ -913,6 +914,8 @@ process_ordered_aggregate_single(AggState *aggstate,
 	MemoryContext workcontext = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	MemoryContext oldContext;
 	bool		isDistinct = (pertrans->numDistinctCols > 0);
+	Datum		newAbbrevVal = (Datum) 0;
+	Datum		oldAbbrevVal = (Datum) 0;
 	FunctionCallInfo fcinfo = >transfn_fcinfo;
 	Datum	   *newVal;
 	bool	   *isNull;
@@ -932,7 +935,7 @@ process_ordered_aggregate_single(AggState *aggstate,
 	 */
 
 	while (tuplesort_getdatum(pertrans->sortstates[aggstate->current_set],
-			  true, newVal, isNull))
+			  true, newVal, isNull, ))
 	{
 		/*
 		 * Clear and select the working 

[HACKERS] [PATCH] Copy-pasteo in logical decoding

2015-12-17 Thread Craig Ringer
Trivial fix for a copy-and-paste error in a logical decoding error callback.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/replication/logical/logical.c 
b/src/backend/replication/logical/logical.c
index 1ce9081..2e14a3e 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -683,7 +683,7 @@ filter_by_origin_cb_wrapper(LogicalDecodingContext *ctx, 
RepOriginId origin_id)
 
/* Push callback + info on the error context stack */
state.ctx = ctx;
-   state.callback_name = "shutdown";
+   state.callback_name = "filter_by_origin";
state.report_location = InvalidXLogRecPtr;
errcallback.callback = output_plugin_error_callback;
errcallback.arg = (void *) 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2015-12-17 Thread Aleksander Alekseev
> Oh, that's an interesting idea.  I guess the problem is that if the
> freelist is unshared, then users might get an error that the lock
> table is full when some other partition still has elements remaining.

True, but I don't believe it should be a real problem assuming we have
a strong hash function. If user get such an error it means that
database is under heavy load and there is not much more free elements
in other tables either. This particular user didn't get lucky, some
other will. Anyway administrator should do something about it -
fight DDoS attack, tune database parameters, etc.

> 3.5-4 more TPS, or 3.5 times more TPS?  Can you share the actual
> numbers?

Oops, 3.5-4 _times_ more TPS, i.e. 2206 TPS vs 546 TPS.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relation extension scalability

2015-12-17 Thread Dilip Kumar
On Sun, Jul 19 2015 9:37 PM Andres Wrote,

> The situation the read() protect us against is that two backends try to
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.

I was looking into this patch, and done some performance testing..

Currently i have done testing my my local machine, later i will perform on
big machine once i get access to that.

Just wanted to share the current result what i get i my local machine
Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM of
RAM).

Test Script:
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
1) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

*Summary of the results:*
1. When data fits into shared buffer improvement is not visible, but when
it don't then some improvement is visible in my local machine (still does
not seems to be scaling, may be we can see some different behaviour in big
machine), Thats because in first case it need not to flush the buffer out.

2. As per Tom's analysis since we are doing extra read it will reduce
performance in lower no of client where RelationExtensionLock is not
bottleneck and same is visible in test result.

As discussed earlier, what about keeping the RelationExtentionLock as it is
and just do the victim buffer search and buffer flushing outside the lock,
that way we can save extra read. Correct me if i have missed something in
this..


Shared Buffer 512 MB
-
Client:  Tps Base Tps Patch
1   145  126
2211  246
4248  302
8225  234


Shared Buffer 5GB
-
Client:  Tps Base Tps Patch
1   165 156
2237244
4294296
8253247


Also observed one problem with patch,

@@ -433,63 +434,50 @@ RelationGetBufferForTuple(Relation relation, Size len,
+ while(true)
+ {
+ buffer = ExtendRelation(relation, MAIN_FORKNUM, bistate->strategy);

bistate can be NULL if direct insert instead of copy case


Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


On Sun, Jul 19, 2015 at 9:37 PM, Andres Freund  wrote:

> On 2015-07-19 11:56:47 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2015-07-19 11:28:25 -0400, Tom Lane wrote:
> > >> At this point session 1 will go and create page 44, won't it, and you
> > >> just wasted a page.
> >
> > > My local code now recognizes that case and uses the page. We just need
> > > to do an PageIsNew().
> >
> > Er, what?  How can you tell whether an all-zero page was or was not
> > just written by another session?
>
> The check is only done while holding the io lock on the relevant page
> (have to hold that anyway), after reading it in ourselves, just before
> setting BM_VALID. As we only can get to that point when there wasn't any
> other entry for the page in the buffer table, that guarantees that no
> other backend isn't currently expanding into that page. Others might
> wait to read it, but those'll wait behind the IO lock.
>
>
> The situation the read() protect us against is that two backends try to
> extend to the same block, but after one of them succeeded the buffer is
> written out and reused for an independent page. So there is no in-memory
> state telling the slower backend that that page has already been used.
>
> Andres
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>