Re: [HACKERS] Parallel bitmap heap scan

2016-10-19 Thread Amit Kapila
On Wed, Oct 19, 2016 at 9:23 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Wed, Oct 19, 2016 at 12:39 PM, Andres Freund <and...@anarazel.de> wrote:
>> Try measuring with something more heavy on bitmap scan time
>> itself. E.g.
>> SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= 
>> '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
>> or similar.  The tpch queries don't actually spend that much time in the
>> bitmapscan itself - the parallization of the rest of the query is what
>> matters...
>
> Yeah, I agree.
>
> I have tested with this query, with exact filter condition it was
> taking parallel sequence scan, so I have modified the filter a bit and
> tested.
>
> Tested with all default configuration in my local machine. I think I
> will generate more such test cases and do detail testing in my
> performance machine.
>
>
> Explain Analyze results:
> -
> On Head:
> 
> postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
> WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
> '1996-03-31'::date);
>
>  QUERY PLAN
> ---
>  Aggregate  (cost=848805.90..848805.91 rows=1 width=32) (actual
> time=12440.165..12440.166 rows=1 loops=1)
>->  Bitmap Heap Scan on lineitem  (cost=143372.40..834833.25
> rows=5589057 width=8) (actual time=1106.217..11183.722 rows=5678841
> loops=1)
>  Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND
> (l_shipdate <= '1996-03-31'::date))
>  Rows Removed by Index Recheck: 20678739
>  Heap Blocks: exact=51196 lossy=528664
>  ->  Bitmap Index Scan on idx_lineitem_shipdate
> (cost=0.00..141975.13 rows=5589057 width=0) (actual
> time=1093.376..1093.376 rows=5678841 loops=1)
>Index Cond: ((l_shipdate >= '1995-01-01'::date) AND
> (l_shipdate <= '1996-03-31'::date))
>  Planning time: 0.185 ms
>  Execution time: 12440.819 ms
> (9 rows)
>
> After Patch:
> ---
> postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
> WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
> '1996-03-31'::date);
>
>QUERY PLAN
>
> --
> -
>  Finalize Aggregate  (cost=792751.16..792751.17 rows=1 width=32)
> (actual time=6660.157..6660.157 rows=1 loops=1)
>->  Gather  (cost=792750.94..792751.15 rows=2 width=32) (actual
> time=6659.378..6660.117 rows=3 loops=1)
>  Workers Planned: 2
>  Workers Launched: 2
>  ->  Partial Aggregate  (cost=791750.94..791750.95 rows=1
> width=32) (actual time=6655.941..6655.941 rows=1 loops=3)
>->  Parallel Bitmap Heap Scan on lineitem
> (cost=143372.40..785929.00 rows=2328774 width=8) (actual
> time=1980.797..6204.232 rows=1892947 loops=
> 3)
>  Recheck Cond: ((l_shipdate >= '1995-01-01'::date)
> AND (l_shipdate <= '1996-03-31'::date))
>  Rows Removed by Index Recheck: 6930269
>  Heap Blocks: exact=17090 lossy=176443
>  ->  Bitmap Index Scan on idx_lineitem_shipdate
> (cost=0.00..141975.13 rows=5589057 width=0) (actual
> time=1933.454..1933.454 rows=5678841
> loops=1)
>Index Cond: ((l_shipdate >=
> '1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date))
>  Planning time: 0.207 ms
>  Execution time: 6669.195 ms
> (13 rows)
>
>
> Summary:
> -> With patch overall execution is 2 time faster compared to head.
> -> Bitmap creation with patch is bit slower compared to head and thats
> because of DHT vs efficient hash table.
>

I think here the impact of slowness due to Bitmap Index Scan is not
much visible, as the time it takes as compare to overall time is less.
However, I think there is an advantage of using DHT as that will allow
us to build the hash table by multiple workers using parallel index
scan in future.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoiding pin scan during btree vacuum

2016-10-19 Thread Amit Kapila
On Thu, Oct 20, 2016 at 4:00 AM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> >> This seems like a might subtle thing to backpatch. If we really want to
>> >> go there, ISTM that the relevant code should stew in an unreleased
>> >> branch for a while, before being backpatched.
>> >
>> > I'm definitely -1 on back-patching such a thing.  Put it in HEAD for
>> > awhile.  If it survives six months or so then we could discuss it again.
>>
>> I agree with Tom.
>
> Okay, several months have passed with this in the development branch and
> now seems a good time to backpatch this all the way back to 9.4.
>

Are you talking about commit -
3e4b7d87988f0835f137f15f5c1a40598dd21f3d?  If yes, do we want to
retain this code in its current form under define UNUSED, is there any
advantage of same.   Another point is that won't this commit make
information in xl_btree_vacuum redundant, so shouldn't we remove it
usage during WAL writing as well?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2016-10-19 Thread Amit Kapila
On Thu, Oct 20, 2016 at 7:39 AM, Peter Geoghegan <p...@heroku.com> wrote:
> On Mon, Oct 17, 2016 at 8:08 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Create Index  With (parallel_workers = 4);
>>
>> If above syntax looks sensible, then we might need to think what
>> should be used for parallel index build.  It seems to me that parallel
>> tuple sort patch [1] proposed by Peter G. is using above syntax for
>> getting the parallel workers input from user for parallel index
>> builds.
>
> Apparently you see a similar issue with other major database systems,
> where similar storage parameter things are kind of "overloaded" like
> this (they are used by both index creation, and by the optimizer in
> considering whether it should use a parallel index scan). That can be
> a kind of a gotcha for their users, but maybe it's still worth it.
>

I have also checked and found that you are right.  In SQL Server, they
are using max degree of parallelism (MAXDOP) parameter which is I
think is common for all the sql statements.

> In
> any case, the complaints I saw about that were from users who used
> parallel CREATE INDEX with the equivalent of my parallel_workers index
> storage parameter, and then unexpectedly found this also forced the
> use of parallel index scan. Not the other way around.
>

I can understand that it can be confusing to users, so other option
could be to provide separate parameters like parallel_workers_build
and parallel_workers where first can be used for index build and
second can be used for scan.  My personal opinion is to have one
parameter, so that users have one less thing to learn about
parallelism.

> Ideally, the parallel_workers storage parameter will rarely be
> necessary because the optimizer will generally do the right thing in
> all case.
>

Yeah, we can choose not to provide any parameter for parallel index
scans, but some users might want to have a parameter similar to
parallel table scans, so it could be handy for them to use.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-18 Thread Amit Kapila
On Wed, Oct 19, 2016 at 5:57 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Tue, Oct 18, 2016 at 10:52 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>>> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapil...@gmail.com> 
>>>> wrote:
>>>>> I think one way to avoid the risk of deadlock in above scenario is to
>>>>> take the cleanup lock conditionally, if we get the cleanup lock then
>>>>> we will delete the items as we are doing in patch now, else it will
>>>>> just mark the tuples as dead and ensure that it won't try to remove
>>>>> tuples that are moved-by-split.  Now, I think the question is how will
>>>>> these dead tuples be removed.  We anyway need a separate mechanism to
>>>>> clear dead tuples for hash indexes as during scans we are marking the
>>>>> tuples as dead if corresponding tuple in heap is dead which are not
>>>>> removed later.  This is already taken care in btree code via
>>>>> kill_prior_tuple optimization.  So I think clearing of dead tuples can
>>>>> be handled by a separate patch.
>>>>
>>>> That seems like it could work.
>>>
>>> I have implemented this idea and it works for MVCC scans.  However, I
>>> think this might not work for non-MVCC scans.  Consider a case where
>>> in Process-1, hash scan has returned one row and before it could check
>>> it's validity in heap, vacuum marks that tuple as dead and removed the
>>> entry from heap and some new tuple has been placed at that offset in
>>> heap.
>>
>> Oops, that's bad.
>>
>>> Now when Process-1 checks the validity in heap, it will check
>>> for different tuple then what the index tuple was suppose to check.
>>> If we want, we can make it work similar to what btree does as being
>>> discussed on thread [1], but for that we need to introduce page-scan
>>> mode as well in hash indexes.   However, do we really want to solve
>>> this problem as part of this patch when this exists for other index am
>>> as well?
>>
>> For what other index AM does this problem exist?
>>
>
> By this problem, I mean to say deadlocks for suspended scans, that can
> happen in btree for non-Mvcc or other type of scans where we don't
> release pin during scan.  In my mind, we have below options:
>
> a. problem of deadlocks for suspended scans should be tackled as a
> separate patch as it exists for other indexes (at least for some type
> of scans).
> b. Implement page-scan mode and then we won't have deadlock problem
> for MVCC scans.
> c. Let's not care for non-MVCC scans unless we have some way to hit
> those for hash indexes and proceed with Dead tuple marking idea.  I
> think even if we don't care for non-MVCC scans, we might hit this
> problem (deadlocks) when the index relation is unlogged.
>

oops, my last sentence is wrong. What I wanted to say is: "I think
even if we don't care for non-MVCC scans, we might hit the problem of
TIDs reuse when the index relation is unlogged."

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-18 Thread Amit Kapila
On Tue, Oct 18, 2016 at 10:52 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapil...@gmail.com> 
>>> wrote:
>>>> I think one way to avoid the risk of deadlock in above scenario is to
>>>> take the cleanup lock conditionally, if we get the cleanup lock then
>>>> we will delete the items as we are doing in patch now, else it will
>>>> just mark the tuples as dead and ensure that it won't try to remove
>>>> tuples that are moved-by-split.  Now, I think the question is how will
>>>> these dead tuples be removed.  We anyway need a separate mechanism to
>>>> clear dead tuples for hash indexes as during scans we are marking the
>>>> tuples as dead if corresponding tuple in heap is dead which are not
>>>> removed later.  This is already taken care in btree code via
>>>> kill_prior_tuple optimization.  So I think clearing of dead tuples can
>>>> be handled by a separate patch.
>>>
>>> That seems like it could work.
>>
>> I have implemented this idea and it works for MVCC scans.  However, I
>> think this might not work for non-MVCC scans.  Consider a case where
>> in Process-1, hash scan has returned one row and before it could check
>> it's validity in heap, vacuum marks that tuple as dead and removed the
>> entry from heap and some new tuple has been placed at that offset in
>> heap.
>
> Oops, that's bad.
>
>> Now when Process-1 checks the validity in heap, it will check
>> for different tuple then what the index tuple was suppose to check.
>> If we want, we can make it work similar to what btree does as being
>> discussed on thread [1], but for that we need to introduce page-scan
>> mode as well in hash indexes.   However, do we really want to solve
>> this problem as part of this patch when this exists for other index am
>> as well?
>
> For what other index AM does this problem exist?
>

By this problem, I mean to say deadlocks for suspended scans, that can
happen in btree for non-Mvcc or other type of scans where we don't
release pin during scan.  In my mind, we have below options:

a. problem of deadlocks for suspended scans should be tackled as a
separate patch as it exists for other indexes (at least for some type
of scans).
b. Implement page-scan mode and then we won't have deadlock problem
for MVCC scans.
c. Let's not care for non-MVCC scans unless we have some way to hit
those for hash indexes and proceed with Dead tuple marking idea.  I
think even if we don't care for non-MVCC scans, we might hit this
problem (deadlocks) when the index relation is unlogged.

Here, even if we want to go with (b), I think we can handle it in a
separate patch, unless you think otherwise.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2016-10-18 Thread Amit Kapila
On Tue, Oct 18, 2016 at 4:08 PM, Rahila Syed <rahilasye...@gmail.com> wrote:
>>Another point which needs some thoughts is whether it is good idea to
>>use index relation size to calculate parallel workers for index scan.
>>I think ideally for index scans it should be based on number of pages
>>to be fetched/scanned from index.
> IIUC, its not possible to know the exact number of pages scanned from an
> index
> in advance.

We can't find the exact numbers of index pages to be scanned, but I
think we can find estimated number of pages to be fetched (refer
cost_index).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-18 Thread Amit Kapila
On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> I think one way to avoid the risk of deadlock in above scenario is to
>> take the cleanup lock conditionally, if we get the cleanup lock then
>> we will delete the items as we are doing in patch now, else it will
>> just mark the tuples as dead and ensure that it won't try to remove
>> tuples that are moved-by-split.  Now, I think the question is how will
>> these dead tuples be removed.  We anyway need a separate mechanism to
>> clear dead tuples for hash indexes as during scans we are marking the
>> tuples as dead if corresponding tuple in heap is dead which are not
>> removed later.  This is already taken care in btree code via
>> kill_prior_tuple optimization.  So I think clearing of dead tuples can
>> be handled by a separate patch.
>
> That seems like it could work.
>

I have implemented this idea and it works for MVCC scans.  However, I
think this might not work for non-MVCC scans.  Consider a case where
in Process-1, hash scan has returned one row and before it could check
it's validity in heap, vacuum marks that tuple as dead and removed the
entry from heap and some new tuple has been placed at that offset in
heap.  Now when Process-1 checks the validity in heap, it will check
for different tuple then what the index tuple was suppose to check.
If we want, we can make it work similar to what btree does as being
discussed on thread [1], but for that we need to introduce page-scan
mode as well in hash indexes.   However, do we really want to solve
this problem as part of this patch when this exists for other index am
as well?


[1]  - 
https://www.postgresql.org/message-id/CACjxUsNtBXe1OfRp%3DacB%2B8QFAVWJ%3Dnr55_HMmqQYceCzVGF4tQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Index Scans

2016-10-17 Thread Amit Kapila
On Thu, Oct 13, 2016 at 8:48 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> As of now, the driving table for parallel query is accessed by
> parallel sequential scan which limits its usage to a certain degree.
> Parallelising index scans would further increase the usage of parallel
> query in many more cases.  This patch enables the parallelism for the
> btree scans.  Supporting parallel index scan for other index types
> like hash, gist, spgist can be done as separate patches.
>

I would like to have an input on the method of selecting parallel
workers for scanning index.  Currently the patch selects number of
workers based on size of index relation and the upper limit of
parallel workers is max_parallel_workers_per_gather.  This is quite
similar to what we do for parallel sequential scan except for the fact
that in parallel seq. scan, we use the parallel_workers option if
provided by user during Create Table.  User can provide
parallel_workers option as below:

Create Table  With (parallel_workers = 4);

Is it desirable to have similar option for parallel index scans, if
yes then what should be the interface for same?  One possible way
could be to allow user to provide it during Create Index as below:

Create Index  With (parallel_workers = 4);

If above syntax looks sensible, then we might need to think what
should be used for parallel index build.  It seems to me that parallel
tuple sort patch [1] proposed by Peter G. is using above syntax for
getting the parallel workers input from user for parallel index
builds.

Another point which needs some thoughts is whether it is good idea to
use index relation size to calculate parallel workers for index scan.
I think ideally for index scans it should be based on number of pages
to be fetched/scanned from index.


[1] - 
https://www.postgresql.org/message-id/CAM3SWZTmkOFEiCDpUNaO4n9-1xcmWP-1NXmT7h0Pu3gM2YuHvg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-17 Thread Amit Kapila
On Thu, Oct 13, 2016 at 12:35 AM, Peter Geoghegan <p...@heroku.com> wrote:
> On Wed, Oct 12, 2016 at 11:09 AM, Robert Haas <robertmh...@gmail.com> wrote:
>
>> On the flip side, what if anything can queries hope to get out of
>> parallel sort that they can't get out of Gather Merge?  One
>> possibility is that a parallel sort might end up being substantially
>> faster than Gather Merge-over-non-parallel sort.  In that case, we
>> obviously should prefer it.
>
> I must admit that I don't know enough about it to comment just yet.
> Offhand, it occurs to me that the Gather Merge sorted input could come
> from a number of different types of paths/nodes, whereas adopting what
> I've done here could only work more or less equivalently to "Gather
> Merge -> Sort -> Seq Scan" -- a special case, really.
>
>> For example, it's possible that you might want to have all
>> workers participate in sorting some data and then divide the result of
>> the sort into equal ranges that are again divided among the workers,
>> or that you might want all workers to sort and then each worker to
>> read a complete copy of the output data.  But these don't seem like
>> particularly mainstream needs, nor do they necessarily seem like
>> problems that parallel sort itself should be trying to solve.
>
> This project of mine is about parallelizing tuplesort.c, which isn't
> really what you want for parallel query -- you shouldn't try to scope
> the problem as "make the sort more scalable using parallelism" there.
> Rather, you want to scope it at "make the execution of the entire
> query more scalable using parallelism", which is really quite a
> different thing, which necessarily involves the executor having direct
> knowledge of partition boundaries.
>

Okay, but what is the proof or why do you think second is going to
better than first?  One thing which strikes as a major difference
between your approach and Gather Merge is that in your approach leader
has to wait till all the workers have done with their work on sorting
whereas with Gather Merge as soon as first one is done, leader starts
with merging.  I could be wrong here, but if I understood it
correctly, then there is a argument that Gather Merge kind of approach
can win in cases where some of the workers can produce sorted outputs
ahead of others and I am not sure if we can dismiss such cases.

+struct Sharedsort
+{
..
+ * Workers increment workersFinished to indicate having finished.  If
+ * this is equal to state.launched within the leader, leader is ready
+ * to merge runs.
+ *
+ * leaderDone indicates if leader is completely done (i.e., was
+ * tuplesort_end called against the state through which parallel output
+ * was consumed?)
+ */
+ int currentWorker;
+ int workersFinished;
..
}

By looking at 'workersFinished' usage, it looks like you have devised
a new way for leader to know when workers have finished which might be
required for this patch.  However, have you tried to use or
investigate if existing infrastructure which serves same purpose could
be used for it?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gather Merge

2016-10-17 Thread Amit Kapila
On Wed, Oct 5, 2016 at 11:35 AM, Rushabh Lathia
<rushabh.lat...@gmail.com> wrote:
> Hi hackers,
>
> Attached is the patch to implement Gather Merge.
>

Couple of review comments:

1.
ExecGatherMerge()
{
..
+ /* No workers? Then never mind. */
+ if (!got_any_worker
||
+ node->nreaders < 2)
+ {
+
ExecShutdownGatherMergeWorkers(node);
+ node->nreaders = 0;
+
}

Are you planning to restrict the use of gather merge based on number
of workers, if there is a valid reason, then I think comments should
be updated for same?

2.
+ExecGatherMerge(GatherMergeState * node){
..
+ if (!node->initialized)
+ {
+ EState   *estate = node->ps.state;
+
GatherMerge *gm = (GatherMerge *) node->ps.plan;
+
+ /*
+ * Sometimes we
might have to run without parallelism; but if parallel
+ * mode is active then we can try to
fire up some workers.
+ */
+ if (gm->num_workers > 0 && IsInParallelMode())
+
{
+ ParallelContext *pcxt;
+ bool got_any_worker =
false;
+
+ /* Initialize the workers required to execute Gather node. */
+
if (!node->pei)
+ node->pei = ExecInitParallelPlan(node-
>ps.lefttree,
+
 estate,
+
 gm->num_workers);
..
}

There is lot of common code between ExecGatherMerge and ExecGather.
Do you think it makes sense to have a common function to avoid the
duplicity?

I see there are small discrepancies in both the codes like I don't see
the use of single_copy flag, as it is present in gather node.

3.
+gather_merge_readnext(GatherMergeState * gm_state, int reader, bool force)
{
..
+ tup = gm_readnext_tuple(gm_state, reader, force, NULL);
+
+ /*
+
 * try to read more tuple into nowait mode and store it into the tuple
+ * array.
+
 */
+ if (HeapTupleIsValid(tup))
+ fill_tuple_array(gm_state, reader);

How the above read tuple is stored in array?  In anycase the above
interface seems slightly awkward to me. Basically, I think what you
are trying to do here is after reading first tuple in waitmode, you
are trying to fill the array by reading more tuples.  So, can't we
push reading of this fist tuple into that function and name it as
form_tuple_array().

4.
+create_gather_merge_path(..)
{
..
+  0 /* FIXME: pathnode->limit_tuples*/);

What exactly you want to fix in above code?

5.
 +/* Tuple array size */
+#define MAX_TUPLE_STORE 10

Have you tried with other values of MAX_TUPLE_STORE?  If yes, then
what are the results?  I think it is better to add a comment why array
size is best for performance.


6.
+/* INTERFACE ROUTINES
+ * ExecInitGatherMerge - initialize the MergeAppend
node
+ * ExecGatherMerge - retrieve the next tuple from the node
+ *
ExecEndGatherMerge - shut down the MergeAppend node
+ *
ExecReScanGatherMerge - rescan the MergeAppend node

typo.  /MergeAppend/GatherMerge

7.
+static TupleTableSlot *gather_merge_getnext(GatherMergeState * gm_state);
+static HeapTuple
gm_readnext_tuple(GatherMergeState * gm_state, int nreader, bool
force, bool *done);

Formatting near GatherMergeState doesn't seem to be appropriate.  I
think you need to add GatherMergeState in typedefs.list and then run
pgindent again.

8.
+ /*
+ * Initialize funnel slot to same tuple descriptor as outer plan.
+ */
+ if
(!ExecContextForcesOids(_state->ps, ))

I think in above comment, you mean Initialize GatherMerge slot.

9.
+ /* Does tuple array have any avaiable tuples? */
/avaiable/available

>
> Open Issue:
>
> - Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
> tqueue.c by
> calling gather_readnext() into per-tuple context. Now for gather merge that
> is
> not possible, as we storing the tuple into Tuple array and we want tuple to
> be
> free only its get pass through the merge sort algorithm. One idea is, we can
> also call gm_readnext_tuple() under per-tuple context (which will fix the
> leak
> into tqueue.c) and then store the copy of tuple into tuple array.
>

Won't extra copy per tuple impact performance?  Is the fix in
mentioned commit was for record or composite types, if so, does
GatherMerge support such types and if it support, does it provide any
benefit over Gather?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Amit Kapila
On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> Hello,
>
> I'm now trying to carry extra performance statistics on CustomScan
> (like DMA transfer rate, execution time of GPU kernels, etc...)
> from parallel workers to the leader process using the DSM segment
> attached by the parallel-context.
> We can require an arbitrary length of DSM using ExecCustomScanEstimate
> hook by extension, then it looks leader/worker can share the DSM area.
> However, we have a problem on this design.
>
> Below is the implementation of ExecEndGather().
>
>   void
>   ExecEndGather(GatherState *node)
>   {
>   ExecShutdownGather(node);
>   ExecFreeExprContext(>ps);
>   ExecClearTuple(node->ps.ps_ResultTupleSlot);
>   ExecEndNode(outerPlanState(node));
>   }
>
> It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> The DSM segment shall be released on this call, so child node cannot
> reference the DSM at the time of ExecEndNode().
>

Before releasing DSM, we do collect all the statistics or
instrumentation information of each node.  Refer
ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
wondering why can't you collect the additional information in the same
way?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-15 Thread Amit Kapila
cking and no_content_lock patches.

So, the crux is that we are seeing more variability in some of the
results because of frequent different clog page accesses which is not
so easy to predict, but I think it is possible with ~100,000 tps.

>
> There's certainly much more interesting stuff in the results, but I
> don't have time for more thorough analysis now - I only intended to do
> some "quick benchmarking" on the patch, and I've already spent days on
> this, and I have other things to do.
>

Thanks a ton for doing such a detailed testing.


[1] - 
https://www.postgresql.org/message-id/CA%2BTgmoahCx6XgprR%3Dp5%3D%3DcF0g9uhSHsJxVdWdUEHN9H2Mv0gkw%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1%2BSoW3FBrdZV%2B3m34uCByK3DMPy_9QQs34yvN8spByzyA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Question about behavior of snapshot too old feature

2016-10-14 Thread Amit Kapila
On Fri, Oct 14, 2016 at 1:40 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> Hi all,
>
> I have a question about behavior of snapshot too old feature.
>
> For example, I set old_snapshot_threshold = 1min and prepare a table
> and two terminals.
> And I did the followings steps.
>
> 1. [Terminal 1] Begin transaction and get snapshot data and wait.
>  BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
>  SELECT * FROM test;
>
> 2. [Terminal 2] Another session updates test table in order to make
> snapshot dirty.
>  BEGIN;
>  UPDATE test SET c = c + 100;
>  COMMIT;
>
> 3. [Terminal 1] 1 minute after, read the test table again in same
> transaction opened at #1. I got no error.
> SELECT * FROM test;
>
> 4. [Terminal 2] Another session reads the test table.
>  BEGIN;
>  SELECT * FROM test;
>  COMMIT;
>
> 5. [Terminal 1] 1 minute after, read the test table again, and got
> "snapshot error" error.
>  SELECT * FROM test;
>
> Since #2 makes a snapshot I got at #1 dirty, I expected to get
> "snapshot too old" error at #3 where I read test table again after
> enough time. But I could never get "snapshot too old" error at #3.
>

Here, the basic idea is that till the time corresponding page is not
pruned or table vacuuming hasn't triggered, this error won't occur.
So, I think what is happening here that during step #4 or step #3, it
has pruned the table, after which you started getting error.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] btree vacuum and suspended scans can deadlock

2016-10-13 Thread Amit Kapila
On Fri, Oct 14, 2016 at 2:52 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Oct 13, 2016 at 6:33 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> If we agree that above is a problematic case, then some of the options
>> to solve it could be (a) Vacuum should not wait for a cleanup lock and
>> instead just give up and start again which I think is a bad idea (b)
>> don't allow to take lock of higher granularity after the scan is
>> suspended, not sure if that is feasible (c) document the above danger,
>> this sounds okay on the ground that nobody has reported the problem
>> till now
>
> I don't think any of these sound particularly good.
>

Tom has suggested something similar to approach (b) in his mail [1],
basically rejecting some commands like Truncate, Reindex,.. if the
current transaction is already holding the table open and I think if
we want to go that way, it might be better to reject any command that
requires lock level higher than the current transaction has on table.
Tom has suggested few more ways to resolve such deadlocks in that
thread, but I think we never implemented those.

Here, one point to think is do we really need to invent some ways to
make hash indexes not prone to that problem when it can occur for
other indexes or even for heap.  Even if want to do something, I think
the solution has to be common.


[1] - https://www.postgresql.org/message-id/21534.1200956...@sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] btree vacuum and suspended scans can deadlock

2016-10-13 Thread Amit Kapila
On Fri, Oct 14, 2016 at 3:14 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Thu, Oct 13, 2016 at 6:33 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> If we agree that above is a problematic case, then some of the options
>>> to solve it could be (a) Vacuum should not wait for a cleanup lock and
>>> instead just give up and start again which I think is a bad idea (b)
>>> don't allow to take lock of higher granularity after the scan is
>>> suspended, not sure if that is feasible (c) document the above danger,
>>> this sounds okay on the ground that nobody has reported the problem
>>> till now
>
>> I don't think any of these sound particularly good.
>
> Note that it's a mistake to imagine that this is specific to indexes;
> the same type of deadlock risk exists just when considering heap cleanup.
> We've ameliorated the heap case quite a bit by recognizing situations
> where it's okay to skip a page and move on, but it's not gone.
> Unfortunately indexes don't get to decide that deletion is optional.
>
> I was about to suggest that maybe we didn't need cleanup locks in btree
> indexes anymore now that SnapshotNow is gone,
>

Wouldn't it still be a problem for SnapshotAny?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] btree vacuum and suspended scans can deadlock

2016-10-13 Thread Amit Kapila
During Vacuum, cleanup in btree need to acquire cleanup lock on the
block and the non-MVCC scans or scans on unlogged index retain the pin
on a buffer during scan of a block.  Now, if the scan is suspended
(use cursor and then fetch one row) after holding a pin on a buffer in
Session-1 and in Session-2 we initiate the vacuum which will wait for
the suspended scan in Session-1 to finish (by this time vacuum has a
ShareUpdateExclusiveLock on the table), now if in Session-1, user
tries to acquire a conflicting lock on the same table, it will create
a deadlock.

Simple test to reproduce deadlock:

create unlogged table ul_t1(c1 int);
create index idx_ul_t1 on ul_t1 (c1);
insert into ul_t1 values(1);
insert into ul_t1 values(2);
delete from ul_t1 where c1=1;

Session-1
set enable_seqscan = off;
begin transaction;
declare c1 cursor for select * from ul_t1 where c1=2;
fetch next from c1; --every thing is good till here

Session-2
vacuum ul_t1;  -- This will wait for session-1 to complete the scan

Session-1
lock table ul_t1 in Access Exclusive Mode;  -- This will wait for
session-2 to complete the vacuum

If we agree that above is a problematic case, then some of the options
to solve it could be (a) Vacuum should not wait for a cleanup lock and
instead just give up and start again which I think is a bad idea (b)
don't allow to take lock of higher granularity after the scan is
suspended, not sure if that is feasible (c) document the above danger,
this sounds okay on the ground that nobody has reported the problem
till now

I have found this problem while analysing similar case for hash index
interaction with vacuum as is being discussed on that thread [1].  We
could think of solving it in a way as proposed on that thread, but I
think as proposed that is not safe for non-MVCC scans.

Thoughts?


[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BJM%2BffHgUfW8wf%2BLgn2eJ1fGjyn6b_L5m0fiTEj2_6Pw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Parallel Index Scans

2016-10-12 Thread Amit Kapila
32.986..15033.093 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=3923772.92..3923772.92 rows=1
width=8) (actual time=15030.354..15030.354 rows=1 loops=3)
   ->  Parallel Index Scan using pgbench_accounts_pkey on
pgbench_accounts  (cost=0.57..3920801.08 rows=1188734 width=4) (actual
time=12476.068..14600.410 rows=330 loops=3)
 Index Cond: ((aid > 1000) AND (aid < 9000))
 Filter: ((bid > 800) AND (bid < 900))
 Rows Removed by Filter: 26699666
 Planning time: 0.244 ms
 Execution time: 15036.081 ms
(11 rows)

The above is a median of 3 runs, all the runs gave almost same
execution time.  Here, we can notice that execution time is reduced by
more than half with two workers and I have tested with four workers
where time is reduced to one-fourth (9128.420 ms) of serial plan.  I
think these results are quite similar to what we got for parallel
sequential scans. Another thing to note is that parallelising index
scans are more beneficial if there is a Filter which removes many rows
fetched from Index Scan or if the Filter is costly (example - filter
contains costly function execution). This observation is also quite
similar to what we have observed with Parallel Sequential Scans.

I think we can parallelise Index Only Scans as well, but I have not
evaluated the same and certainly it can be done as a separate patch in
future.

Contributions
----
First patch (parallel_index_scan_v1.patch) implements parallelism at
IndexAM level - Rahila Syed and Amit Kapila based on design inputs and
suggestions by Robert Haas
Second patch (parallel_index_opt_exec_support_v1.patch) provides
optimizer and executor support for parallel index scans - Amit Kapila

The order to use these patches is first apply condition variable patch
[1] then  parallel_index_scan_v1.patch and then
parallel_index_opt_exec_support_v1.patch

Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAEepm%3D0zshYwB6wDeJCkrRJeoBM%3DjPYBe%2B-k_VtKRU_8zMLEfA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAGPqQf09oPX-cQRpBKS0Gq49Z%2Bm6KBxgxd_p9gX8CKk_d75HoQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


parallel_index_scan_v1.patch
Description: Binary data


parallel_index_opt_exec_support_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-10 Thread Amit Kapila
On Mon, Oct 10, 2016 at 10:07 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Mon, Oct 10, 2016 at 5:55 AM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>> On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>> > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmh...@gmail.com>
>> > wrote:
>> >
>> >> Another thing I don't quite understand about this algorithm is that in
>> >> order to conditionally lock the target primary bucket page, we'd first
>> >> need to read and pin it.  And that doesn't seem like a good thing to
>> >> do while we're holding a shared content lock on the metapage, because
>> >> of the principle that we don't want to hold content locks across I/O.
>> >>
>> >
>>
>> Aren't we already doing this during BufferAlloc() when the buffer
>> selected by StrategyGetBuffer() is dirty?
>
>
> Right, you probably shouldn't allocate another buffer while holding a
> content lock on a different one, if you can help it.
>

I don't see the problem in that, but I guess the simple rule is that
we should not hold content locks for longer duration, which could
happen if we do I/O, or need to allocate a new buffer.

> But, BufferAlloc
> doesn't do that internally, does it?
>

You are right that BufferAlloc() doesn't allocate a new buffer while
holding content lock on another buffer, but it does perform I/O while
holding content lock.

>  It is only a problem if you make it be
> one by the way you use it.  Am I missing something?
>
>>
>>
>> > I think we can release metapage content lock before reading the buffer.
>> >
>>
>> On thinking about this again, if we release the metapage content lock
>> before reading and pinning the primary bucket page, then we need to
>> take it again to verify if the split has happened during the time we
>> don't have a lock on a metapage.  Releasing and again taking content
>> lock on metapage is not
>> good from the performance aspect.  Do you have some other idea for this?
>
>
> Doesn't the relcache patch effectively deal wit hthis?  If this is a
> sticking point, maybe the relcache patch could be incorporated into this
> one.
>

Yeah, relcache patch would eliminate the need for metapage locking,
but that is not a blocking point.  As this patch is mainly to enable
WAL logging, so there is no urgency to incorporate relcache patch,
even if we have to go with an algorithm where we need to take the
metapage lock twice to verify the splits.  Having said that, I am
okay, if Robert and or others are also in favour of combining the two
patches (patch in this thread and cache the metapage patch).   If we
don't want to hold content lock across another ReadBuffer call, then
another option could be to modify the read algorithm as below:

read the metapage
compute bucket number for target hash key based on metapage contents
read the required block
loop:
acquire shared content lock on metapage
recompute bucket number for target hash key based on metapage contents
   if the recomputed block number is not same as the block number we read
  release meta page content lock
  read the recomputed block number
   else
   break;
if (we can't get a shared content lock on the target bucket without blocking)
loop:
release meta page content lock
take a shared content lock on the target primary bucket page
take a shared content lock on the metapage
if (previously-computed target bucket has not been split)
break;

The basic change here is that first we compute the target block number
*without* locking metapage and then after locking the metapage, if
both doesn't match, then we need to again read the computed block
number.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cygwin64 assertion failure

2016-10-10 Thread Amit Kapila
On Sun, Oct 9, 2016 at 10:23 PM, Andrew Dunstan <and...@dunslane.net> wrote:
> lorikeet seems to be stuck running the parallel tests, after having failed
> an assertion.
>

It looks like this problem has been discussed before [1].  Shall we do
what Tom has suggested there?


[1] - https://www.postgresql.org/message-id/15344.1473974558%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] vacuumdb -f and -j options (was Question / requests.)

2016-10-10 Thread Amit Kapila
On Sun, Oct 9, 2016 at 10:59 PM, Francisco Olarte
<fola...@peoplecall.com> wrote:
> On Sat, Oct 8, 2016 at 2:22 PM, Michael Paquier
> <michael.paqu...@gmail.com> wrote:
>> On Sat, Oct 8, 2016 at 9:12 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>
>>> After reading Francisco's proposal [1], I don't think it is directly
>>> trying to make -f and -j work together.  He is proposing to make it
>>> work by providing some new options.  As you are wondering upthread, I
>>> think it seems reasonable to disallow -f with parallel vacuuming if no
>>> tables are specified.
>
> For me -f & -j is not perfect, but better than not having it. It can
> deadlock when given certain sets of catalog tables, either by making
> it go for the full db or by a perverse set of -t options. But any DBA
> needing them together should, IMO, have resources to write ( or have
> someone else write for him ) a 20-liner wrapping and feeding them via
> -t. After all, not every tool/option is for everyone, and everything
> has it prerequisites.
>

Okay, but I think that doesn't mean it should deadlock when used by
somewhat naive user.  I am not sure every user who wants to use -f and
-j is smart enough to write a script as you are suggesting.  I think
if more people see your proposal as meaningful and want to leave
current usage of -f and -j as it is, then probably, we should issue a
warning indicating such a risk.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-10 Thread Amit Kapila
On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmh...@gmail.com> wrote:
>
>> Another thing I don't quite understand about this algorithm is that in
>> order to conditionally lock the target primary bucket page, we'd first
>> need to read and pin it.  And that doesn't seem like a good thing to
>> do while we're holding a shared content lock on the metapage, because
>> of the principle that we don't want to hold content locks across I/O.
>>
>

Aren't we already doing this during BufferAlloc() when the buffer
selected by StrategyGetBuffer() is dirty?

> I think we can release metapage content lock before reading the buffer.
>

On thinking about this again, if we release the metapage content lock
before reading and pinning the primary bucket page, then we need to
take it again to verify if the split has happened during the time we
don't have a lock on a metapage.  Releasing and again taking content
lock on metapage is not
good from the performance aspect.  Do you have some other idea for this?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] vacuumdb -f and -j options (was Question / requests.)

2016-10-08 Thread Amit Kapila
On Sat, Oct 8, 2016 at 5:52 PM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> On Sat, Oct 8, 2016 at 9:12 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Fri, Oct 7, 2016 at 10:16 PM, Alvaro Herrera
>> <alvhe...@2ndquadrant.com> wrote:
>>> Robert Haas wrote:
>>>> On Wed, Oct 5, 2016 at 10:58 AM, Francisco Olarte
>>>> I don't know, but it seems like the documentation for vacuumdb
>>>> currently says, more or less, "Hey, if you use -j with -f, it may not
>>>> work!", which seems unacceptable to me.  It should be the job of the
>>>> person writing the feature to make it work in all cases, not the job
>>>> of the person using the feature to work around the problem when it
>>>> doesn't.
>>>
>>> The most interesting use case of vacuumdb is lazy vacuuming, I think, so
>>> committing that patch as it was submitted previously was a good step
>>> forward even if it didn't handle VACUUM FULL 100%.
>>>
>>> I agree that it's better to have both modes Just Work in parallel, which
>>> is the point of this subsequent patch.  So let's move forward.  I
>>> support Francisco's effort to make -f work with -j.  I don't have a
>>> strong opinion on which of the various proposals presented so far is the
>>> best way to implement it, but let's figure that out and get it done.
>>>
>>
>> After reading Francisco's proposal [1], I don't think it is directly
>> trying to make -f and -j work together.  He is proposing to make it
>> work by providing some new options.  As you are wondering upthread, I
>> think it seems reasonable to disallow -f with parallel vacuuming if no
>> tables are specified.
>
> Instead of restricting completely things, I'd like to think that being
> able to make both of them work together is the right move at the end.
>

Sure, if somebody can come up with a patch which can safely avoid the
deadlock when both -f and -j options are used, then we should go that
way. Otherwise we can block those options to be used together rather
than just have a note in docs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] vacuumdb -f and -j options (was Question / requests.)

2016-10-08 Thread Amit Kapila
On Fri, Oct 7, 2016 at 10:16 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Wed, Oct 5, 2016 at 10:58 AM, Francisco Olarte
>> I don't know, but it seems like the documentation for vacuumdb
>> currently says, more or less, "Hey, if you use -j with -f, it may not
>> work!", which seems unacceptable to me.  It should be the job of the
>> person writing the feature to make it work in all cases, not the job
>> of the person using the feature to work around the problem when it
>> doesn't.
>
> The most interesting use case of vacuumdb is lazy vacuuming, I think, so
> committing that patch as it was submitted previously was a good step
> forward even if it didn't handle VACUUM FULL 100%.
>
> I agree that it's better to have both modes Just Work in parallel, which
> is the point of this subsequent patch.  So let's move forward.  I
> support Francisco's effort to make -f work with -j.  I don't have a
> strong opinion on which of the various proposals presented so far is the
> best way to implement it, but let's figure that out and get it done.
>

After reading Francisco's proposal [1], I don't think it is directly
trying to make -f and -j work together.  He is proposing to make it
work by providing some new options.  As you are wondering upthread, I
think it seems reasonable to disallow -f with parallel vacuuming if no
tables are specified.


[1] - 
https://www.postgresql.org/message-id/CA%2BbJJbx8%2BSKBU%3DXUE%2BHxZHysh9226iMfTnA69AznwRTOEGtR7Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench more operators & functions

2016-10-08 Thread Amit Kapila
On Sat, Oct 8, 2016 at 12:58 PM, Fabien COELHO <coe...@cri.ensmp.fr> wrote:
>
> Hello Tom,
>
> I comment here on the first part of your remarks. I've answered the second
> part in another mail.
>
>>> (1) The required schema is slightly different : currently the type used
>>> for holding balances is not wide enough per the TCP-B standard, this mean
>>> maybe having an option to do "pgbench -i --standard-tpcb" which would
>>> generate the right schema, probably it should just change a few INTEGER
>>> to
>>> INT8, or maybe use NUMERIC(10). I have not done such a patch yet.
>>
>>
>> The whole question of the database setup is an interesting one.
>> If we were to do anything at all here, I'd want to see not only the table
>> schemas and initial population, but also the hard-wired "vacuum" logic,
>> somehow made not so hard-wired.  I have no good ideas about that. The SQL
>> commands could possibly be taken from scripts, but what of all the work
>> that's gone into friendly progress reporting for table loading?
>
>
> I'm unconvince by the current status, especially the default behaviors. I
> think it should do a good sensible representative job, and not be a minimum
> installation.
>
> For instance, the default setup does not use foreign keys. It should be the
> reverse, foreign keys should be included by default and an option should be
> used to lessen the schema quality.
>
> Also, given the heavy UPDATE nature of the pgbench test, a non 100% default
> fill factor on some tables would make sense.
>

FWIW, sometime back I have seen that with fill factor 80, at somewhat
moderate client counts (32) on 192 - Hyper Threaded m/c, the
performance is 20~30% better, but at higher client counts, it was same
as 100 fill factor.  I think if go by your theory, one could also
argue to have non-default values autovacuum threshold parameters.
pgbench already has a parameter to specify non-default fillfactor and
I think that is sufficient for anyone to do performance testing.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. wait events

2016-10-08 Thread Amit Kapila
On Fri, Oct 7, 2016 at 11:34 PM, Alfred Perlstein <alf...@freebsd.org> wrote:
>
> Are the batched writes all done before fsync is called?
>

In most cases, but if while writing, the wal segment (size - 16M) is
finished, we do fsync and then do the remaining writes and at end
again perform fsync.

> Are you sure that A only calls fsync after flushing all the buffers from B,
> C, and D?  Or will it fsync twice?
>

I don't think it is guaranteed, however if B, C and D has finished
writing to wal buffers before A starts writing out buffer, then A will
ensure to combine all those writes and call fsync just once.

>  Is there instrumentation to show that?
>

Currently, I don't think we have instrumentation to show that, but you
can do it if you want.  In XLogFlush(), we do print the write request
values (if
 XLOG_DEBUG is set and we have compiled the code with WAL_DEBUG
option), you can put the one print in XLOGWrite once the write/flush
is done and another print when the request is satisfied without
calling XLOGWrite.

> I know there's a tremendous level of skill involved in this code, but simply
> asking in case there's some tricks.
>
> Another strategy that may work is actually intentionally waiting/buffering
> some few ms between flushes/fsync,

We do that before attempting to write if user has set "commit_delay"
and "commit_siblings" guc parameters.

Now here, we can't buffer the fsync requests as current we are doing
both writes and fsync under one lock.  However, if we can split the
work such that writes are done under one lock and fsync under separate
lock, then probably we can try to buffer fsync requests and after
fsyncing the current pending requests, we can recheck if there are
more pending requests and try to flush them.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-07 Thread Amit Kapila
On Fri, Oct 7, 2016 at 3:02 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
>
> I got access to a large machine with 72/144 cores (thanks to Oleg and
> Alexander from Postgres Professional), and I'm running the tests on that
> machine too.
>
> Results from Dilip's workload (with scale 300, unlogged tables) look like
> this:
>
> 32  64128 192224 256288
>   master104943  128579  72167  100967  66631   97088  63767
>   granular-locking  103415  141689  83780  120480  71847  115201  67240
>   group-update  105343  144322  92229  130149  81247  126629  76638
>   no-content-lock   103153  140568  80101  119185  70004  115386  66199
>
> So there's some 20-30% improvement for >= 128 clients.
>

So here we see performance improvement starting at 64 clients, this is
somewhat similar to what Dilip saw in his tests.

> But what I find much more intriguing is the zig-zag behavior. I mean, 64
> clients give ~130k tps, 128 clients only give ~70k but 192 clients jump up
> to >100k tps again, etc.
>

No clear answer.

> FWIW I don't see any such behavior on pgbench, and all those tests were done
> on the same cluster.
>
>>> With 4.5.5, results for the same benchmark look like this:
>>>
>>>64128192
>>> 
>>>  master 35693  39822  42151
>>>  granular-locking   35370  39409  41353
>>>  no-content-lock36201  39848  42407
>>>  group-update   35697  39893  42667
>>>
>>> That seems like a fairly bad regression in kernel, although I have not
>>> identified the feature/commit causing it (and it's also possible the
>>> issue
>>> lies somewhere else, of course).
>>>
>>> With regular pgbench, I see no improvement on any kernel version. For
>>> example on 3.19 the results look like this:
>>>
>>>64128192
>>> 
>>>  master 54661  61014  59484
>>>  granular-locking   55904  62481  60711
>>>  no-content-lock56182  62442  61234
>>>  group-update   55019  61587  60485
>>>
>>
>> Are the above results with synchronous_commit=off?
>>
>
> No, but I can do that.
>
>>> I haven't done much more testing (e.g. with -N to eliminate
>>> collisions on branches) yet, let's see if it changes anything.
>>>
>>
>> Yeah, let us see how it behaves with -N. Also, I think we could try
>> at higher scale factor?
>>
>
> Yes, I plan to do that. In total, I plan to test combinations of:
>
> (a) Dilip's workload and pgbench (regular and -N)
> (b) logged and unlogged tables
> (c) scale 300 and scale 3000 (both fits into RAM)
> (d) sync_commit=on/off
>

sounds sensible.

Thanks for doing the tests.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-06 Thread Amit Kapila
On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> I think one way to avoid the risk of deadlock in above scenario is to
>> take the cleanup lock conditionally, if we get the cleanup lock then
>> we will delete the items as we are doing in patch now, else it will
>> just mark the tuples as dead and ensure that it won't try to remove
>> tuples that are moved-by-split.  Now, I think the question is how will
>> these dead tuples be removed.  We anyway need a separate mechanism to
>> clear dead tuples for hash indexes as during scans we are marking the
>> tuples as dead if corresponding tuple in heap is dead which are not
>> removed later.  This is already taken care in btree code via
>> kill_prior_tuple optimization.  So I think clearing of dead tuples can
>> be handled by a separate patch.
>
> That seems like it could work.  The hash scan code will need to be
> made smart enough to ignore any tuples marked dead, if it isn't
> already.
>

It already takes care of ignoring killed tuples in below code, though
the way is much less efficient as compare to btree.  Basically, it
fetches the matched tuple and then check if it is dead where as in
btree while matching the key, it does the same check.  It might be
efficient to do it before matching the hashkey, but I think it is a
matter of separate patch.
hashgettuple()
{
..
/*
* Skip killed tuples if asked to.
*/
if (scan->ignore_killed_tuples)
}



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-10-05 Thread Amit Kapila
On Tue, Oct 4, 2016 at 7:50 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Oct 4, 2016 at 9:20 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>>> I'd say that the reason for not using included columns in any
>>>> operations which require comparisons, is that they don't have opclass.
>>>> Let's go back to the example of points.
>>>> This data type don't have any opclass for B-tree, because of fundamental
>>>> reasons.
>>>> And we can not apply _bt_compare() and others to this attribute, so
>>>> we don't include it to scan key.
>>>>
>>>> create table t (i int, i2 int, p point);
>>>> create index idx1 on (i) including (i2);
>>>> create index idx2 on (i) including (p);
>>>> create index idx3 on (i) including (i2, p);
>>>> create index idx4 on (i) including (p, i2);
>>>>
>>>> You can keep tuples ordered in idx1, but not for idx2, partially ordered 
>>>> for
>>>> idx3, but not for idx4.
>>>
>>> Yeah, I think we shouldn't go there.  I mean, once you start ordering
>>> by INCLUDING columns, then you're going to need to include them in
>>> leaf pages because otherwise you can't actually guarantee that they
>>> are in the right order.
>>
>> I am not sure what you mean by above, because patch already stores
>> INCLUDING columns in leaf pages.
>
> Sorry, I meant non-leaf pages.
>

Okay, but in that case I think we don't need to store including
columns in non-leaf pages to get the exact ordering.  As mentioned
upthread, we can use truncated scan key to reach to leaf level and
then use the complete key to find the exact location to store the key.
This is only possible if there exists an opclass for columns that are
covered as part of including clause.  So, we can allow "order by" to
use index scan only if the columns covered in included clause have
opclass for btree.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-05 Thread Amit Kapila
On Wed, Oct 5, 2016 at 12:05 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> Hi,
>
> After collecting a lot more results from multiple kernel versions, I can
> confirm that I see a significant improvement with 128 and 192 clients,
> roughly by 30%:
>
>64128192
> 
>  master 62482  43181  50985
>  granular-locking   61701  59611  47483
>  no-content-lock62650  59819  47895
>  group-update   63702  64758  62596
>
> But I only see this with Dilip's workload, and only with pre-4.3.0 kernels
> (the results above are from kernel 3.19).
>

That appears positive.

> With 4.5.5, results for the same benchmark look like this:
>
>64128192
> 
>  master 35693  39822  42151
>  granular-locking   35370  39409  41353
>  no-content-lock36201  39848  42407
>  group-update   35697  39893  42667
>
> That seems like a fairly bad regression in kernel, although I have not
> identified the feature/commit causing it (and it's also possible the issue
> lies somewhere else, of course).
>
> With regular pgbench, I see no improvement on any kernel version. For
> example on 3.19 the results look like this:
>
>64128192
> 
>  master 54661  61014  59484
>  granular-locking   55904  62481  60711
>  no-content-lock56182  62442  61234
>  group-update   55019  61587  60485
>

Are the above results with synchronous_commit=off?

> I haven't done much more testing (e.g. with -N to eliminate collisions on
> branches) yet, let's see if it changes anything.
>

Yeah, let us see how it behaves with -N.  Also, I think we could try
at higher scale factor?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-10-04 Thread Amit Kapila
On Tue, Sep 27, 2016 at 7:51 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Sep 26, 2016 at 11:17 AM, Anastasia Lubennikova
> <a.lubennik...@postgrespro.ru> wrote:
>>> Is there any fundamental problem in storing them in ordered way?  I
>>> mean to say, you need to anyway store all the column values on leaf
>>> page, so why can't we find the exact location for the complete key.
>>> Basically use truncated key to reach to leaf level and then use the
>>> complete key to find the exact location to store the key.  I might be
>>> missing some thing here, but if we can store them in ordered fashion,
>>> we can use them even for queries containing ORDER BY (where ORDER BY
>>> contains included columns).
>>
>> I'd say that the reason for not using included columns in any
>> operations which require comparisons, is that they don't have opclass.
>> Let's go back to the example of points.
>> This data type don't have any opclass for B-tree, because of fundamental
>> reasons.
>> And we can not apply _bt_compare() and others to this attribute, so
>> we don't include it to scan key.
>>
>> create table t (i int, i2 int, p point);
>> create index idx1 on (i) including (i2);
>> create index idx2 on (i) including (p);
>> create index idx3 on (i) including (i2, p);
>> create index idx4 on (i) including (p, i2);
>>
>> You can keep tuples ordered in idx1, but not for idx2, partially ordered for
>> idx3, but not for idx4.
>
> Yeah, I think we shouldn't go there.  I mean, once you start ordering
> by INCLUDING columns, then you're going to need to include them in
> leaf pages because otherwise you can't actually guarantee that they
> are in the right order.
>

I am not sure what you mean by above, because patch already stores
INCLUDING columns in leaf pages.

>  And then you have to wonder why an INCLUDING
> column is any different from a non-INCLUDING column.  It seems best to
> make a firm rule that INCLUDING columns are there only for the values,
> not for ordering.  That rule is simple and clear, which is a good
> thing.
>

Okay, we can make that firm rule, but I think reasoning behind that
should be clear.  As far as I get it by reading some of the mails in
this thread, it is because some of the other databases doesn't seem to
support ordering for included columns or supporting the same can
complicate the code.  One point, we should keep in mind that
suggestion for including many other columns in INCLUDING clause to use
Index Only scans by other databases might not hold equally good for
PostgreSQL because it can lead to many HOT updates as non-HOT updates.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-03 Thread Amit Kapila
On Tue, Oct 4, 2016 at 10:06 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>>
>>> As I was looking at the old text regarding deadlock risk, I realized
>>> what may be a serious problem.  Suppose process A is performing a scan
>>> of some hash index.  While the scan is suspended, it attempts to take
>>> a lock and is blocked by process B.  Process B, meanwhile, is running
>>> VACUUM on that hash index.  Eventually, it will do
>>> LockBufferForCleanup() on the hash bucket on which process A holds a
>>> buffer pin, resulting in an undetected deadlock. In the current
>>> coding, A would hold a heavyweight lock and B would attempt to acquire
>>> a conflicting heavyweight lock, and the deadlock detector would kill
>>> one of them.  This patch probably breaks that.  I notice that that's
>>> the only place where we attempt to acquire a buffer cleanup lock
>>> unconditionally; every place else, we acquire the lock conditionally,
>>> so there's no deadlock risk.  Once we resolve this problem, the
>>> paragraph about deadlock risk in this section should be revised to
>>> explain whatever solution we come up with.
>>>
>>> By the way, since VACUUM must run in its own transaction, B can't be
>>> holding arbitrary locks, but that doesn't seem quite sufficient to get
>>> us out of the woods.  It will at least hold ShareUpdateExclusiveLock
>>> on the relation being vacuumed, and process A could attempt to acquire
>>> that same lock.
>>>
>>
>> Right, I think there is a danger of deadlock in above situation.
>> Needs some more thoughts.
>>
>
> I think one way to avoid the risk of deadlock in above scenario is to
> take the cleanup lock conditionally, if we get the cleanup lock then
> we will delete the items as we are doing in patch now, else it will
> just mark the tuples as dead and ensure that it won't try to remove
> tuples that are moved-by-split.  Now, I think the question is how will
> these dead tuples be removed.  We anyway need a separate mechanism to
> clear dead tuples for hash indexes as during scans we are marking the
> tuples as dead if corresponding tuple in heap is dead which are not
> removed later.  This is already taken care in btree code via
> kill_prior_tuple optimization.  So I think clearing of dead tuples can
> be handled by a separate patch.
>

I think we can also remove the dead tuples next time when vacuum
visits the bucket and is able to acquire the cleanup lock.  Right now,
we are just checking if the corresponding heap tuple is dead, we can
have an additional check as well to ensure if the current item is dead
in index, then consider it in list of deletable items.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-03 Thread Amit Kapila
On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>> As I was looking at the old text regarding deadlock risk, I realized
>> what may be a serious problem.  Suppose process A is performing a scan
>> of some hash index.  While the scan is suspended, it attempts to take
>> a lock and is blocked by process B.  Process B, meanwhile, is running
>> VACUUM on that hash index.  Eventually, it will do
>> LockBufferForCleanup() on the hash bucket on which process A holds a
>> buffer pin, resulting in an undetected deadlock. In the current
>> coding, A would hold a heavyweight lock and B would attempt to acquire
>> a conflicting heavyweight lock, and the deadlock detector would kill
>> one of them.  This patch probably breaks that.  I notice that that's
>> the only place where we attempt to acquire a buffer cleanup lock
>> unconditionally; every place else, we acquire the lock conditionally,
>> so there's no deadlock risk.  Once we resolve this problem, the
>> paragraph about deadlock risk in this section should be revised to
>> explain whatever solution we come up with.
>>
>> By the way, since VACUUM must run in its own transaction, B can't be
>> holding arbitrary locks, but that doesn't seem quite sufficient to get
>> us out of the woods.  It will at least hold ShareUpdateExclusiveLock
>> on the relation being vacuumed, and process A could attempt to acquire
>> that same lock.
>>
>
> Right, I think there is a danger of deadlock in above situation.
> Needs some more thoughts.
>

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split.  Now, I think the question is how will
these dead tuples be removed.  We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later.  This is already taken care in btree code via
kill_prior_tuple optimization.  So I think clearing of dead tuples can
be handled by a separate patch.

I have also thought about using page-scan-at-a-time idea which has
been discussed upthread[1], but I think we can't completely eliminate
the need to out-wait scans (cleanup lock requirement) for scans that
are started when split-in-progress or for non-MVCC scans as described
in that e-mail [1].  We might be able to find some way to solve the
problem with this approach, but I think it will be slightly
complicated and much more work is required as compare to previous
approach.

What is your preference among above approaches to resolve this problem
or let me know if you have a better idea to solve it?


[1] - 
https://www.postgresql.org/message-id/CAA4eK1Jj1UqneTXrywr%3DGg87vgmnMma87LuscN_r3hKaHd%3DL2g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Amit Kapila
On 30-Sep-2016 10:26 PM, "Peter Geoghegan" <p...@heroku.com> wrote:
>
> On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas <robertmh...@gmail.com>
wrote:
> > I would just be very disappointed if, after the amount of work that
> > Amit and others have put into this project, the code gets rejected
> > because somebody thinks a different project would have been more worth
> > doing.
>
> I wouldn't presume to tell anyone else how to spend their time, and am
> not concerned about this patch making the hash index code any less
> useful from the user's perspective. If this is how we remove the wart
> of hash indexes not being WAL-logged, that's fine by me. I'm trying to
> be helpful.
>

If that is fine, then I think we should do that.  I want to bring it to
your notice that we have already seen and reported that with proposed set
of patches, hash indexes are good bit faster than btree, so that adds
additional value in making them WAL-logged.

> > As Tom said upthread: $But to kick the hash AM as such to the
> > curb is to say
> > "sorry, there will never be O(1) index lookups in Postgres".$  I
> > think that's correct and a sufficiently-good reason to pursue this
> > work, regardless of the merits (or lack of merits) of hash-over-btree.
>
> I don't think that "O(1) index lookups" is a useful guarantee with a
> very expensive constant factor.

The constant factor doesn't play much role when data doesn't have
duplicates or have lesser duplicates.

Amit seemed to agree with this, since
> he spoke of the importance of both theoretical performance benefits
> and practically realizable performance benefits.
>

No, I don't agree with that rather I think hash indexes are theoretically
faster than btree and we have seen that practically as well for quite a few
cases (for read workloads - when used with unique data and also in nest
loops).

With Regards,
Amit Kapila


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Amit Kapila
On 30-Sep-2016 6:24 AM, "Robert Haas" <robertmh...@gmail.com> wrote:
>
> On Thu, Sep 29, 2016 at 8:29 PM, Andres Freund <and...@anarazel.de> wrote:
> > On September 29, 2016 5:28:00 PM PDT, Robert Haas <robertmh...@gmail.com>
wrote:
> >>On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund <and...@anarazel.de>
> >>wrote:
> >>>> Well, I, for one, find it frustrating.  It seems pretty unhelpful to
> >>>> bring this up only after the code has already been written.
> >>>
> >>> I brought this up in person at pgcon too.
> >>
> >>To whom?  In what context?
> >
> > Amit, over dinner.
>
> OK, well, I can't really comment on that, then, except to say that if
> you waited three months to follow up on the mailing list, you probably
> can't blame Amit if he thought that it was more of a casual suggestion
> than a serious objection.  Maybe it was?  I don't know.
>

Both of them have talked about hash indexes with me offline. Peter
mentioned that it would be better to improve btree rather than hash
indexes. IIRC, Andres asked me mainly about what use cases I have in mind
for hash indexes and then we do have some further discussion on the same
thing where he was not convinced that there is any big use case for hash
indexes even though there may be some cases. In that discussion, as he is
saying and I don't doubt him, he would have told me the alternative, but it
was not apparent to me that he is expecting some sort of comparison.

What I got from both the discussions was a friendly gesture that it might
be a better use of my time, if I work on some other problem.  I really
respect suggestions from both of them, but it was no where clear to me that
any one of  them is expecting any comparison of other approach.

Considering,  I have missed the real intention of their suggestions, I
think such a serious objection on any work should be discussed on list.  To
answer the actual objection, I have already mentioned upthread that we can
deduce from the current tests done by Jesper and Mithun that there are some
cases where hash index will be better than hash-over-btree (tests done over
integer columns).  I think any discussion on whether we should consider not
to improve current hash indexes is only meaningful if some one has a  code
which can prove both theoretically and practically that it is better than
hash indexes for all usages.

Note - excuse me for formatting of this email as I am on travel and using
my phone.

With Regards,
Amit Kapila.


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Amit Kapila
> do while we're holding a shared content lock on the metapage, because
> of the principle that we don't want to hold content locks across I/O.
>

I think we can release metapage content lock before reading the buffer.

>  -- then, per read request:
> release pin on metapage
> -read current page of bucket and take shared buffer content lock
> -step to next page if necessary (no chaining of locks)
> +if the split is in progress for current bucket and this is a new bucket
> +release the buffer content lock on current bucket page
> +pin and acquire the buffer content lock on old bucket in shared mode
> +release the buffer content lock on old bucket, but not pin
> +retake the buffer content lock on new bucket
> +mark the scan such that it skips the tuples that are marked
> as moved by split
>
> Aren't these steps done just once per scan?  If so, I think they
> should appear before "-- then, per read request" which AIUI is
> intended to imply a loop over tuples.
>

As per code, there is no such intention (loop over tuples).  It is
about reading the page and getting the tuple.

> +step to next page if necessary (no chaining of locks)
> +if the scan indicates moved by split, then move to old bucket
> after the scan
> +of current bucket is finished
>  get tuple
>  release buffer content lock and pin on current page
>  -- at scan shutdown:
> -release bucket share-lock
>
> Don't we have a pin to release at scan shutdown in the new system?
>

Yes, it is mentioned in line below:

+ release any pin we hold on current buffer, old bucket buffer, new
bucket buffer
+


> Well, I was hoping to get through the whole patch in one email, but
> I'm not even all the way through the README.  However, it's late, so
> I'm stopping here for now.
>

Thanks for the review!



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-29 Thread Amit Kapila
On Thu, Sep 29, 2016 at 12:56 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Thu, Sep 29, 2016 at 6:40 AM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> Yes, definitely - we're missing something important, I think. One difference
>> is that Dilip is using longer runs, but I don't think that's a problem (as I
>> demonstrated how stable the results are).
>>
>> I wonder what CPU model is Dilip using - I know it's x86, but not which
>> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
>> model and it makes a difference (although that seems unlikely).
>
> I am using "Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz "
>

Another difference is that m/c on which Dilip is doing tests has 8 sockets.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wal_segment size vs max_wal_size

2016-09-26 Thread Amit Kapila
On Mon, Sep 26, 2016 at 4:00 PM, Kuntal Ghosh
<kuntalghosh.2...@gmail.com> wrote:
> On Wed, Sep 21, 2016 at 8:33 PM, Peter Eisentraut
> <peter.eisentr...@2ndquadrant.com> wrote:
>> There is apparently some misbehavior if max_wal_size is less than 5 *
>> wal_segment_size.
>>
>
>> This should probably be made friendlier in some way.  But it also shows
>> that bigger WAL segment sizes are apparently not well-chartered
>> territory lately.
>>
> Well, there can be multiple solutions to this problem.
> 1. If somebody intends to increase wal segment size, he should
> increase max_wal_size accordingly.
> 2. In recovery test, we can add some delay before taking backup so
> that the pending logs in the buffer
> gets flushed. (Not a good solution)
> 3. In CreateRestartPoint() method, we can force a XLogFlush to update
> minRecoveryPoint.
>

IIRC, there is already a patch to update the minRecoveryPoint
correctly, can you check if that solves the problem for you?

[1] - 
https://www.postgresql.org/message-id/20160609.215558.118976703.horiguchi.kyotaro%40lab.ntt.co.jp

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possibly too stringent Assert() in b-tree code

2016-09-25 Thread Amit Kapila
On Mon, Sep 26, 2016 at 8:54 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Sun, Sep 25, 2016 at 9:31 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Amit Kapila <amit.kapil...@gmail.com> writes:
>>> On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>>> This is clearly an oversight in Simon's patch fafa374f2, which introduced
>>>> this code without any consideration for the possibility that the page
>>>> doesn't have a valid special area. ...
>>>> but I'm not very clear on whether this is a safe fix, mainly because
>>>> I don't understand what the reuse WAL record really accomplishes.
>>>> Maybe we need to instead generate a reuse record with some special
>>>> transaction ID indicating worst-case assumptions?
>>
>>> I think it is basically to ensure that the queries on standby should
>>> not be considering the page being recycled as valid and if there is
>>> any pending query to which this page is visible, cancel it.  If this
>>> understanding is correct, then I think the fix you are proposing is
>>> correct.
>>
>> After thinking about it some more, I do not understand why we are emitting
>> WAL here at all in *any* case, either for a new page or for a deleted one
>> that we're about to recycle.  Why wouldn't the appropriate actions have
>> been taken when the page was deleted, or even before that when its last
>> tuple was deleted?
>>
>
> It seems to me that we do take actions for conflict resolution during
> the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO
> which we emit in vacuum), but not sure if that is sufficient.
> Consider a case where the new transaction is started on standby after
>

Here by new transaction, I intend to say some newer snapshot with
valid MyPgXact->xmin.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possibly too stringent Assert() in b-tree code

2016-09-25 Thread Amit Kapila
On Sun, Sep 25, 2016 at 9:31 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>> On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> This is clearly an oversight in Simon's patch fafa374f2, which introduced
>>> this code without any consideration for the possibility that the page
>>> doesn't have a valid special area. ...
>>> but I'm not very clear on whether this is a safe fix, mainly because
>>> I don't understand what the reuse WAL record really accomplishes.
>>> Maybe we need to instead generate a reuse record with some special
>>> transaction ID indicating worst-case assumptions?
>
>> I think it is basically to ensure that the queries on standby should
>> not be considering the page being recycled as valid and if there is
>> any pending query to which this page is visible, cancel it.  If this
>> understanding is correct, then I think the fix you are proposing is
>> correct.
>
> After thinking about it some more, I do not understand why we are emitting
> WAL here at all in *any* case, either for a new page or for a deleted one
> that we're about to recycle.  Why wouldn't the appropriate actions have
> been taken when the page was deleted, or even before that when its last
> tuple was deleted?
>

It seems to me that we do take actions for conflict resolution during
the page deletion (that looks to be covered by XLOG_HEAP2_CLEANUP_INFO
which we emit in vacuum), but not sure if that is sufficient.
Consider a case where the new transaction is started on standby after
page deletion and the same still precedes the value of xact on page,
such transactions must be cancelled before we can reuse the page.  I
think the fact that before recycling the page we do ensure that no
transaction is interested in that page in master indicates something
similar is required in standby as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-24 Thread Amit Kapila
On Sun, Sep 25, 2016 at 10:30 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>
>> I think here I am slightly wrong.  For the full page writes, it do use
>> RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
>> doing page verification check and rather blindly setting the page to
>> zero and then overwrites it with full page image.  So after my fix,
>> you will not see the error of checksum failure.  I have a fix ready,
>> but still doing some more verification.  If everything passes, I will
>> share the patch in a day or so.
>>
>
> Attached patch fixes the problem, now we do perform full page writes
> for bitmap pages.  Apart from that, I have rebased the patch based on
> latest concurrent index patch [1].  I have updated the README as well
> to reflect the WAL logging related information for different
> operations.
>
> With attached patch, all the review comments or issues found till now
> are addressed.
>

I forgot to mention that Ashutosh has tested this patch for a day
using Jeff's tool and he didn't found any problem.  Also, he has found
a way to easily reproduce the problem.  Ashutosh, can you share your
changes to the script using which you have reproduce the problem?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-24 Thread Amit Kapila
On Sat, Sep 24, 2016 at 10:49 PM, Greg Stark <st...@mit.edu> wrote:
> On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> But to kick the hash AM as such to the curb is to say
>> "sorry, there will never be O(1) index lookups in Postgres".
>
> Well there's plenty of halfway solutions for that. We could move hash
> indexes to contrib or even have them in core as experimental_hash or
> unlogged_hash until the day they achieve their potential.
>
> We definitely shouldn't discourage people from working on hash indexes
>

Okay, but to me it appears that naming it as experimental_hash or
moving it to contrib could discourage people or at the very least
people will be less motivated.  Thinking on those lines a year or so
back would have been a wise direction, but now when already there is
lot of work done (patches to make it wal-enabled, more concurrent and
performant, page inspect module are available) for hash indexes and
still more is in progress, that sounds like a step backward then step
forward.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possibly too stringent Assert() in b-tree code

2016-09-24 Thread Amit Kapila
On Fri, Sep 23, 2016 at 12:21 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Mon, Sep 19, 2016 at 7:07 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> I think you have a valid point.  It seems we don't need to write WAL
>>> for reuse page (aka don't call _bt_log_reuse_page()), if the page is
>>> new, as the only purpose of that log is to handle conflict based on
>>> transaction id stored in special area which will be anyway zero.
>
>> +1.
>
> This is clearly an oversight in Simon's patch fafa374f2, which introduced
> this code without any consideration for the possibility that the page
> doesn't have a valid special area.  We could prevent the crash by
> doing nothing if PageIsNew, a la
>
> if (_bt_page_recyclable(page))
> {
> /*
>  * If we are generating WAL for Hot Standby then create a
>  * WAL record that will allow us to conflict with queries
>  * running on standby.
>  */
> -   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
> +   if (XLogStandbyInfoActive() && RelationNeedsWAL(rel) &&
> +   !PageIsNew(page))
> {
> BTPageOpaque opaque = (BTPageOpaque) 
> PageGetSpecialPointer(page);
>
> _bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
> }
>
> /* Okay to use page.  Re-initialize and return it */
>
> but I'm not very clear on whether this is a safe fix, mainly because
> I don't understand what the reuse WAL record really accomplishes.
> Maybe we need to instead generate a reuse record with some special
> transaction ID indicating worst-case assumptions?
>

I think it is basically to ensure that the queries on standby should
not be considering the page being recycled as valid and if there is
any pending query to which this page is visible, cancel it.  If this
understanding is correct, then I think the fix you are proposing is
correct.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] raised checkpoint limit & manual checkpoint

2016-09-24 Thread Amit Kapila
On Sat, Sep 24, 2016 at 12:12 PM, Fabien COELHO <coe...@cri.ensmp.fr> wrote:
>
> Hello,
>
> The checkpoint time limit has just been raised to one day after a discussion
> started by Andres Freund:
>
> https://www.postgresql.org/message-id/20160202001320.GP8743%40awork2.anarazel.de
>
> I would have gone further up, say one week or even one month, but I think
> that this new limit is an improvement over the previous 1 hour maximum.
>
> Now ISTM that there is a possible use case which arises with this new
> setting and is not well addressed by postgresql:
>
> Let us say that an application has periods of high and low usage, say over a
> day, so that I want to avoid a checkpoint from 8 to 20, but I'm okay after
> that. I could raise the size and time limits so that they do not occur
> during these hours and plan to do a manual CHECKPOINT once a day when I see
> fit.
>
> Now the problem I see is that CHECKPOINT means "do a CHECKPOINT right now as
> fast as possible", i.e. there is no throttling whatsoever, which leads to
> heavy IO and may result in a very unresponsive database.
>
> I would suggest that a good complementary feature would be to allow a manual
> checkpoint to run over a period of time, say something like:
>
>   CHECKPOINT OVER '10 hours';
>
> That would target to complete after this period (whether it succeeds or not
> is another issue) instead of going as fast as possible, thus avoiding
> some performance degradation.
>

Isn't it somewhat overlaps with existing parameter
checkpoint_completion_target?  You can use
checkpoint_completion_target to throttle the checkpoints.  The option
you are suggesting seems to be more straight forward, but how will
user decide the time he wants Checkpoint to take.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-09-24 Thread Amit Kapila
On Wed, Sep 21, 2016 at 6:51 PM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
> 20.09.2016 08:21, Amit Kapila:
>
> On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
> <a.lubennik...@postgrespro.ru> wrote:
>
> 28.08.2016 09:13, Amit Kapila:
>
>
> The problem seems really tricky, but the answer is simple.
> We store included columns unordered. It was mentioned somewhere in
> this thread.
>

Is there any fundamental problem in storing them in ordered way?  I
mean to say, you need to anyway store all the column values on leaf
page, so why can't we find the exact location for the complete key.
Basically use truncated key to reach to leaf level and then use the
complete key to find the exact location to store the key.  I might be
missing some thing here, but if we can store them in ordered fashion,
we can use them even for queries containing ORDER BY (where ORDER BY
contains included columns).

> Let me give you an example:
>
> create table t (i int, p point);
> create index on (i) including (p);
> "point" data type doesn't have any opclass for btree.
> Should we insert (0, '(0,2)') before (0, '(1,1)') or after?
> We have no idea what is the "correct order" for this attribute.
> So the answer is "it doesn't matter". When searching in index,
> we know that only key attrs are ordered, so only them can be used
> in scankey. Other columns are filtered after retrieving data.
>
> explain select i,p from t where i =0 and p <@ circle '((0,0),2)';
> QUERY PLAN
> ---
>  Index Only Scan using idx on t  (cost=0.14..4.20 rows=1 width=20)
>Index Cond: (i = 0)
>Filter: (p <@ '<(0,0),2>'::circle)
>

I think here reason for using Filter is that because we don't keep
included columns in scan keys, can't we think of having them in scan
keys, but use only key columns in scan key to reach till leaf level
and then use complete scan key at leaf level.

>
> The same approach is used for included columns of any type, even if
> their data types have opclass.
>
> Is this truncation concept of high key needed for correctness of patch
> or is it just to save space in index?   If you need this, then I think
> nbtree/Readme needs to be updated.
>
>
> Now it's done only for space saving. We never check included attributes
> in non-leaf pages, so why store them? Especially if we assume that included
> attributes can be quite long.
> There is already a note in documentation:
>
> +It's the same with other constraints (PRIMARY KEY and EXCLUDE).
> This can
> +also can be used for non-unique indexes as any columns which are
> not required
> +for the searching or ordering of records can be included in the
> +INCLUDING clause, which can slightly reduce the size of
> the index,
> +due to storing included attributes only in leaf index pages.
>

Okay, thanks for clarification.

> What should I add to README (or to documentation),
> to make it more understandable?
>

May be add the data representation like only leaf pages contains all
the columns and how the scan works.  I think you can see if you can
extend "Notes About Data Representation" and or "Other Things That Are
Handy to Know" sections in existing README.

> -- I am getting Assertion failure when I use this patch with database
> created with a build before this patch.  However, if I create a fresh
> database it works fine.  Assertion failure details are as below:
>
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> TRAP: unrecognized TOAST vartag("((bool) 1)", File:
> "src/backend/access/common/h
> eaptuple.c", Line: 532)
> LOG:  server process (PID 1404) was terminated by exception 0x8003
> HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
> valu
> e.
> LOG:  terminating any other active server processes
>
>
> That is expected behavior, because catalog versions are not compatible.
> But I wonder why there was no message about that?
> I suppose, that's because CATALOG_VERSION_NO was outdated in my
> patch. As well as I know, committer will change it before the commit.
> Try new patch with updated value. It should fail with a message about
> incompatible versions.
>

Yeah, that must be reason, but lets not change it now, otherwise we
will face conflicts while applying patch.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] store narrow values in hash indexes?

2016-09-23 Thread Amit Kapila
On Sat, Sep 24, 2016 at 1:02 AM, Robert Haas <robertmh...@gmail.com> wrote:
> Currently, hash indexes always store the hash code in the index, but
> not the actual Datum.  It's recently been noted that this can make a
> hash index smaller than the corresponding btree index would be if the
> column is wide.  However, if the index is being built on a fixed-width
> column with a typlen <= sizeof(Datum), we could store the original
> value in the hash index rather than the hash code without using any
> more space.  That would complicate the code, but I bet it would be
> faster: we wouldn't need to set xs_recheck, we could rule out hash
> collisions without visiting the heap, and we could support index-only
> scans in such cases.
>

What exactly you mean by Datum?  Is it for datatypes that fits into 64
bits like integer.  I think if we are able to support index only scans
for hash indexes for some data types, that will be a huge plus.
Surely, there is some benefit without index only scans as well, which
is we can avoid recheck, but not sure if that alone can give us any
big performance boost.  As, you say, it might lead to some
complication in code, but I think it is worth trying.

Won't it add some requirements for pg_upgrade as well?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 8:22 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 09/23/2016 03:07 PM, Amit Kapila wrote:
>>
>> On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
>> <tomas.von...@2ndquadrant.com> wrote:
>>>
>>> On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>>>>
>>>>
>>>> ...
>>>> The 4.5 kernel clearly changed the results significantly:
>>>>
>>> ...
>>>>
>>>>
>>>>
>>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>>> eliminated the fluctuations in the results. For example when 3.2.80
>>>> produced this results (10 runs with the same parameters):
>>>>
>>>> 12118 11610 27939 11771 18065
>>>> 12152 14375 10983 13614 11077
>>>>
>>>> we get this on 4.5.5
>>>>
>>>> 37354 37650 37371 37190 37233
>>>> 38498 37166 36862 37928 38509
>>>>
>>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>>
>>>
>>> The more I think about these random spikes in pgbench performance on
>>> 3.2.80,
>>> the more I find them intriguing. Let me show you another example (from
>>> Dilip's workload and group-update patch on 64 clients).
>>>
>>> This is on 3.2.80:
>>>
>>>   44175  34619  51944  38384  49066
>>>   37004  47242  36296  46353  36180
>>>
>>> and on 4.5.5 it looks like this:
>>>
>>>   34400  35559  35436  34890  34626
>>>   35233  35756  34876  35347  35486
>>>
>>> So the 4.5.5 results are much more even, but overall clearly below
>>> 3.2.80.
>>> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
>>> randomly do something right, but what is it and why doesn't it happen on
>>> the
>>> new kernel? And how could we do it every time?
>>>
>>
>> As far as I can see you are using default values of min_wal_size,
>> max_wal_size, checkpoint related params, have you changed default
>> shared_buffer settings, because that can have a bigger impact.
>
>
> Huh? Where do you see me using default values?
>

I was referring to one of your script @ http://bit.ly/2doY6ID.  I
haven't noticed that you have changed default values in
postgresql.conf.

> There are settings.log with a
> dump of pg_settings data, and the modified values are
>
> checkpoint_completion_target = 0.9
> checkpoint_timeout = 3600
> effective_io_concurrency = 32
> log_autovacuum_min_duration = 100
> log_checkpoints = on
> log_line_prefix = %m
> log_timezone = UTC
> maintenance_work_mem = 524288
> max_connections = 300
> max_wal_size = 8192
> min_wal_size = 1024
> shared_buffers = 2097152
> synchronous_commit = on
> work_mem = 524288
>
> (ignoring some irrelevant stuff like locales, timezone etc.).
>
>> Using default values of mentioned parameters can lead to checkpoints in
>> between your runs.
>
>
> So I'm using 16GB shared buffers (so with scale 300 everything fits into
> shared buffers), min_wal_size=16GB, max_wal_size=128GB, checkpoint timeout
> 1h etc. So no, there are no checkpoints during the 5-minute runs, only those
> triggered explicitly before each run.
>

Thanks for clarification.  Do you think we should try some different
settings *_flush_after parameters as those can help in reducing spikes
in writes?

>> Also, I think instead of 5 mins, read-write runs should be run for 15
>> mins to get consistent data.
>
>
> Where does the inconsistency come from?

Thats what I am also curious to know.

> Lack of warmup?

Can't say, but at least we should try to rule out the possibilities.
I think one way to rule out is to do slightly longer runs for Dilip's
test cases and for pgbench we might need to drop and re-create
database after each reading.

> Considering how
> uniform the results from the 10 runs are (at least on 4.5.5), I claim this
> is not an issue.
>

It is quite possible that it is some kernel regression which might be
fixed in later version.  Like we are doing most tests in cthulhu which
has 3.10 version of kernel and we generally get consistent results.
I am not sure if later version of kernel say 4.5.5 is a net win,
because there is a considerable difference (dip) of performance in
that version, though it produces quite stable results.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared m

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 7:41 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>> Now, it is quite possible
>> that error code is set to 0 on success in my windows environment
>> (Win7) and doesn't work in some other environment.  In any case, if we
>> want to go ahead and don't want to rely on CreateFileMapping(), then
>> attached patch should suffice the need.
>
> Yeah, seeing that this is not mentioned in MS' documentation I don't
> think we should rely on it.  The patch looks fine to me, pushed.
>

Thanks.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 6:50 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> I don't dare to suggest rejecting the patch, but I don't see how we could
>> commit any of the patches at this point. So perhaps "returned with feedback"
>> and resubmitting in the next CF (along with analysis of improved workloads)
>> would be appropriate.
>
> I think it would be useful to have some kind of theoretical analysis
> of how much time we're spending waiting for various locks.  So, for
> example, suppose we one run of these tests with various client counts
> - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run "select
> wait_event from pg_stat_activity" once per second throughout the test.
> Then we see how many times we get each wait event, including NULL (no
> wait event).  Now, from this, we can compute the approximate
> percentage of time we're spending waiting on CLogControlLock and every
> other lock, too, as well as the percentage of time we're not waiting
> for lock.  That, it seems to me, would give us a pretty clear idea
> what the maximum benefit we could hope for from reducing contention on
> any given lock might be.
>

As mentioned earlier, such an activity makes sense, however today,
again reading this thread, I noticed that Dilip has already posted
some analysis of lock contention upthread [1].  It is clear that patch
has reduced LWLock contention from ~28% to ~4% (where the major
contributor was TransactionIdSetPageStatus which has reduced from ~53%
to ~3%).  Isn't it inline with what you are looking for?


[1] - 
https://www.postgresql.org/message-id/CAFiTN-u-XEzhd%3DhNGW586fmQwdTy6Qy6_SXe09tNB%3DgBcVzZ_A%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 6:29 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
> On Fri, Sep 23, 2016 at 6:05 PM, Tomas Vondra <tomas.von...@2ndquadrant.com>
> wrote:
>>
>> On 09/23/2016 05:10 AM, Amit Kapila wrote:
>>>
>>> On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
>>> <tomas.von...@2ndquadrant.com> wrote:
>>>>
>>>> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>>>>
>>>>>
>>>>
>>>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>>>> eliminated the fluctuations in the results. For example when 3.2.80
>>>> produced
>>>> this results (10 runs with the same parameters):
>>>>
>>>> 12118 11610 27939 11771 18065
>>>> 12152 14375 10983 13614 11077
>>>>
>>>> we get this on 4.5.5
>>>>
>>>> 37354 37650 37371 37190 37233
>>>> 38498 37166 36862 37928 38509
>>>>
>>>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>>>
>>>
>>> how long each run was?  Generally, I do half-hour run to get stable
>>> results.
>>>
>>
>> 10 x 5-minute runs for each client count. The full shell script driving
>> the benchmark is here: http://bit.ly/2doY6ID and in short it looks like
>> this:
>>
>> for r in `seq 1 $runs`; do
>> for c in 1 8 16 32 64 128 192; do
>> psql -c checkpoint
>> pgbench -j 8 -c $c ...
>> done
>> done
>
>
>
> I see couple of problems with the tests:
>
> 1. You're running regular pgbench, which also updates the small tables. At
> scale 300 and higher clients, there is going to heavy contention on the
> pgbench_branches table. Why not test with pgbench -N? As far as this patch
> is concerned, we are only interested in seeing contention on
> ClogControlLock. In fact, how about a test which only consumes an XID, but
> does not do any write activity at all? Complete artificial workload, but
> good enough to tell us if and how much the patch helps in the best case. We
> can probably do that with a simple txid_current() call, right?
>

Right, that is why in the initial tests done by Dilip, he has used
Select .. for Update.  I think using txid_current will generate lot of
contention on XidGenLock which will mask the contention around
CLOGControlLock, in-fact we have tried that.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 6:16 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 09/23/2016 01:44 AM, Tomas Vondra wrote:
>>
>> ...
>> The 4.5 kernel clearly changed the results significantly:
>>
> ...
>>
>>
>> (c) Although it's not visible in the results, 4.5.5 almost perfectly
>> eliminated the fluctuations in the results. For example when 3.2.80
>> produced this results (10 runs with the same parameters):
>>
>> 12118 11610 27939 11771 18065
>> 12152 14375 10983 13614 11077
>>
>> we get this on 4.5.5
>>
>> 37354 37650 37371 37190 37233
>> 38498 37166 36862 37928 38509
>>
>> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>>
>
> The more I think about these random spikes in pgbench performance on 3.2.80,
> the more I find them intriguing. Let me show you another example (from
> Dilip's workload and group-update patch on 64 clients).
>
> This is on 3.2.80:
>
>   44175  34619  51944  38384  49066
>   37004  47242  36296  46353  36180
>
> and on 4.5.5 it looks like this:
>
>   34400  35559  35436  34890  34626
>   35233  35756  34876  35347  35486
>
> So the 4.5.5 results are much more even, but overall clearly below 3.2.80.
> How does 3.2.80 manage to do ~50k tps in some of the runs? Clearly we
> randomly do something right, but what is it and why doesn't it happen on the
> new kernel? And how could we do it every time?
>

As far as I can see you are using default values of min_wal_size,
max_wal_size, checkpoint related params, have you changed default
shared_buffer settings, because that can have a bigger impact.  Using
default values of mentioned parameters can lead to checkpoints in
between your runs. Also, I think instead of 5 mins, read-write runs
should be run for 15 mins to get consistent data.  For Dilip's
workload where he is using only Select ... For Update, i think it is
okay, but otherwise you need to drop and re-create the database
between each run, otherwise data bloat could impact the readings.

I think in general, the impact should be same for both the kernels
because you are using same parameters, but I think if use appropriate
parameters, then you can get consistent results for 3.2.80.  I have
also seen variation in read-write tests, but the variation you are
showing is really a matter of concern, because it will be difficult to
rely on final data.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 6:11 PM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> On 9/23/16 1:56 AM, Amit Kapila wrote:
>> which comment are you referring here?  hashm_mapp contains block
>> numbers of bitmap pages.
>
> The comment I'm referring to says
>
> The blknos of these bitmap pages are kept in bitmaps[]; nmaps is the
> number of currently existing bitmaps.
>
> But there is no "bitmaps" field anywhere.
>

Okay.  You are right, it should be hashm_mapp.

>> In the above code, it appears that you are trying to calculate
>> max_avail space for all pages in same way.  Don't you need to
>> calculate it differently for bitmap page or meta page as they don't
>> share the same format as other type of pages?
>
> Is this even useful for hash indexes?
>

I think so.  It will be helpful for bucket and overflow pages.  They
store the index tuples similar to btree.  Is there a reason, why you
think it won't be useful?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-23 Thread Amit Kapila
On Thu, Sep 22, 2016 at 10:16 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>>
>>
>> Correct.  But any torn page write must be covered by the restoration of a
>> full page image during replay, shouldn't it?  And that restoration should
>> happen blindly, without first reading in the old page and verifying the
>> checksum.
>>
>
> Probably, but I think this is not currently the way it is handled and
> I don't want to change it.  AFAIU, what happens now is that we first
> read the old page (and we do page verification while reading old page
> and display *warning* if checksum doesn't match) and then restore the
> image.  The relevant code path is
> XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().
>
> Now, here the point is that why instead of WARNING we are seeing FATAL
> for the bitmap page of hash index.  The reason as explained in my
> previous e-mail is that for bitmap page we are not logging full page
> image and we should fix that as explained there.  Once the full page
> image is logged, then first time it reads the torn page, it will use
> flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
> to WARNING.
>

I think here I am slightly wrong.  For the full page writes, it do use
RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
doing page verification check and rather blindly setting the page to
zero and then overwrites it with full page image.  So after my fix,
you will not see the error of checksum failure.  I have a fix ready,
but still doing some more verification.  If everything passes, I will
share the patch in a day or so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment

2016-09-23 Thread Amit Kapila
On Thu, Sep 22, 2016 at 10:40 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>> On Tue, Sep 20, 2016 at 10:33 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> ISTM both the previous coding and this version can fail for no good
>>> reason, that is what if GetLastError() happens to return one of these
>>> error codes as a result of some unrelated failure from before this
>>> subroutine is entered?  That is, wouldn't it be a good idea to
>>> do SetLastError(0) before calling CreateFileMapping?
>
>> Yes, that seems like a good idea.  Do you need a patch with some
>> testing on windows environment?
>
> Please; I can't test it.
>

Attached patch tightens the error handling.  However, on debugging, I
found that CreateFileMapping() always set error code to 0 on success.
Basically, before calling CreateFileMapping(), I have set the error
code as 10 (SetLastError(10)) and then after CreateFileMapping(), it
sets the error code to 0 on success and appropriate error code on
failure.  I also verified that error code is set to 10 by calling
GetLastError() before CreateFileMapping().  Now, it is quite possible
that error code is set to 0 on success in my windows environment
(Win7) and doesn't work in some other environment.  In any case, if we
want to go ahead and don't want to rely on CreateFileMapping(), then
attached patch should suffice the need.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


clear_errorcode_dsm-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/PostgreSQL.851401618

2016-09-23 Thread Amit Kapila
On Fri, Sep 23, 2016 at 1:21 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Tue, Sep 20, 2016 at 12:53 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> I'd be the first to agree that this point is inadequately documented
>>> in the code, but PostmasterRandom should be reserved for its existing
>>> security-related uses, not exposed to the world for (ahem) random other
>>> uses.
>
>> So, we could have dsm_postmaster_startup() seed the random number
>> generator itself, and then let PostmasterRandom() override the seed
>> later.  Like maybe:
>
> Works for me, at least as a temporary solution.  The disturbing thing
> here is that this still only does what we want if dsm_postmaster_startup
> happens before the first PostmasterRandom call --- which is OK today but
> seems pretty fragile.  Still, redesigning PostmasterRandom's seeding
> technique is not something to do right before 9.6 release.  Let's revert
> the prior patch and do it as you have here:
>
>> struct timeval tv;
>> gettimeofday(, NULL);
>> srandom(tv.tv_sec);
>> ...
>> dsm_control_handle = random();
>
> for the time being.
>

Isn't it better if we use the same technique in dsm_create() as well
which uses random() for handle?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2016-09-22 Thread Amit Kapila
On Fri, Sep 23, 2016 at 9:40 AM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> On 9/21/16 9:30 AM, Jesper Pedersen wrote:
>> Attached is v5, which add basic page verification.
>
> There are still some things that I found that appear to follow the old
> style (btree) conventions instead the new style (brin, gin) conventions.
>
> - Rename hash_metap to hash_metapage_info.
>
> - Don't use BuildTupleFromCStrings().  (There is nothing inherently
> wrong with it, but why convert numbers to strings and back again when it
> can be done more directly.)
>
> - Documentation for hash_page_stats still has blkno argument.
>
> - In hash_metap, the magic field could be type bytea, so that it
> displays in hex.  Or text like the brin functions.
>
> Some other comments:
>
> - hash_metap result fields spares and mapp should be arrays of integer.
>

how would you like to see both those arrays in tuple, right now, I
think Jesper's code is showing it as string.

> (Incidentally, the comment in hash.h talks about bitmaps[] but I think
> it means hashm_mapp[].)
>

which comment are you referring here?  hashm_mapp contains block
numbers of bitmap pages.

While looking at patch, I noticed below code which seems somewhat problematic:

+ stat->max_avail = BLCKSZ - (BLCKSZ - phdr->pd_special + SizeOfPageHeaderData);
+
+ /* page type (flags) */
+ if (opaque->hasho_flag & LH_META_PAGE)
+ stat->type = 'm';
+ else if (opaque->hasho_flag & LH_OVERFLOW_PAGE)
+ stat->type = 'v';
+ else if (opaque->hasho_flag & LH_BUCKET_PAGE)
+ stat->type = 'b';
+ else if (opaque->hasho_flag & LH_BITMAP_PAGE)
+ stat->type = 'i';

In the above code, it appears that you are trying to calculate
max_avail space for all pages in same way.  Don't you need to
calculate it differently for bitmap page or meta page as they don't
share the same format as other type of pages?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tracking wait event for latches

2016-09-22 Thread Amit Kapila
On Fri, Sep 23, 2016 at 7:02 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Sep 22, 2016 at 7:10 PM, Thomas Munro
> <thomas.mu...@enterprisedb.com> wrote:
>
>> I was thinking about suggesting a category "Replication" to cover the
>> waits for client IO relating to replication, as opposed to client IO
>> waits relating to regular user connections.  Then you could put sync
>> rep into that category instead of IPC, even though technically it is
>> waiting for IPC from walsender process(es), on the basis that it's
>> more newsworthy to a DBA that it's really waiting for a remote replica
>> to respond.  But it's probably pretty clear what's going on from the
>> the wait point names, so maybe it's not worth a category.  Thoughts?
>
> I thought about a replication category but either it will only have
> SyncRep in it, which is odd, or it will pull in other things that
> otherwise fit nicely into the Activity category, and then that
> boundaries of all the categories become mushy: is the subsystem that
> causes the wait that we are trying to document, or the kind of thing
> for which we are waiting?
>

I also think that it can add some confusion in defining boundaries and
also might not be consistent with current way we have defined the
waits, however there is some value in using subsystem which is that
user can identify the bottlenecks with ease.  I think this applies to
both Replication and Parallel Query.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tracking wait event for latches

2016-09-22 Thread Amit Kapila
On Thu, Sep 22, 2016 at 7:19 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Sep 21, 2016 at 5:49 PM, Thomas Munro
> <thomas.mu...@enterprisedb.com> wrote:
>
> So, I tried to classify these.  Here's what I came up with.
>
> Activity: ArchiverMain, AutoVacuumMain, BgWriterMain,
> BgWriterHibernate, CheckpointerMain, PgStatMain, RecoveryWalAll,
> RecoveryWalStream, SysLoggerMain, WalReceiverMain, WalWriterMain
>
> Client: SecureRead, SecureWrite, SSLOpenServer, WalSenderMain,
> WalSenderWaitForWAL, WalSenderWriteData, WalReceiverWaitStart
>
> Timeout: BaseBackupThrottle, PgSleep, RecoveryApplyDelay
>
> IPC: BgWorkerShutdown, BgWorkerStartup, ExecuteGather,
> MessageQueueInternal, MessageQueuePutMessage, MessageQueueReceive,
> MessageQueueSend, ParallelFinish, ProcSignal, ProcSleep, SyncRep
>
> Extension: Extension
>

We already call lwlock waits from an extension as "extension", so I
think just naming this an Extension might create some confusion.

> I classified all of the main loop waits as waiting for activity; all
> of those are background processes that are waiting for something to
> happen and are more or less happy to sleep forever until it does.  I
> also included the RecoveryWalAll and RecoveryWalStream events in
> there; those don't have the sort of "main loop" flavor of the others
> but they are happy to wait more or less indefinitely for something to
> occur.  Likewise, it was pretty easy to find all of the events that
> were waiting for client I/O, and I grouped those all under "Client".
> A few of the remaining wait events seemed like they were clearly
> waiting for a particular timeout to expire, so I gave those their own
> "Timeout" category.
>
> I believe these categorizations are actually useful for users.  For
> example, you might want to see all of the waits in the system but
> exclude the "Client", "Activity", and "Timeout" categories because
> those are things that aren't signs of a problem.  A "Timeout" wait is
> one that you explicitly requested, a "Client" wait isn't the server's
> fault, and an "Activity" wait just means nothing is happening.  In
> contrast, a "Lock" or "LWLock" or "IPC" wait shows that something is
> actually delaying work that we'd ideally prefer to have get done
> sooner.
>
> I grouped the rest of this stuff as "IPC" because all of these events
> are cases where one server process is waiting for another server
> processes .  That could be further subdivided, of course: most of
> those events are only going to occur in relation to parallel query,
> but I didn't want to group it that way explicitly because both
> background workers and shm_mq have other potential uses.  ProcSignal
> and ProcSleep are related to heavyweight locks and SyncRep is of
> course related to synchronous replication.   But they're all related
> in that one server process is waiting for another server process to
> tell it that a certain state has been reached, so IPC seems like a
> good categorization.
>
> Finally, extensions got their own category in this taxonomy, though I
> wonder if it would be better to instead have
> Activity/ExtensionActivity, Client/ExtensionClient,
> Timeout/ExtensionTimeout, and IPC/ExtensionIPC instead of making it a
> separate toplevel category.
>

+1. It can avoid confusion.

> To me, this seems like a pretty solid toplevel categorization and a
> lot more useful than just throwing all of these in one bucket and
> saying "good luck".

Agreed.  This categorisation is very good and can help patch author to proceed.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-22 Thread Amit Kapila
On Thu, Sep 22, 2016 at 7:32 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Sep 22, 2016 at 8:36 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> I think for certain cases like into clause, the rows passed will be
>> equal to actual number of rows, otherwise it will generate error.  So
>> we can pass that information in executor layer.  Another kind of cases
>> which are worth considering are when from plpgsql we fetch limited
>> rows at-a-time, but we fetch till end like the case of
>> exec_stmt_return_query().
>
> Yes, I think that those cases can be considered.  Some careful code
> inspection will be needed to make sure the cases we want to enable are
> safe, and some testing will be needed to make sure they behave
> properly.  But it doesn't sound like an unsolvable problem.  I hope
> someone who isn't me will decide to work on it.  :-)
>

makes sense.  I think along with that we can also evaluate, if we can
enable parallel query from other pl languages. I think if we can
enable parallelism from all pl languages in 10.0, that will be a good
step forward.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-22 Thread Amit Kapila
On Fri, Sep 23, 2016 at 5:14 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 09/21/2016 08:04 AM, Amit Kapila wrote:
>>
>
> (c) Although it's not visible in the results, 4.5.5 almost perfectly
> eliminated the fluctuations in the results. For example when 3.2.80 produced
> this results (10 runs with the same parameters):
>
> 12118 11610 27939 11771 18065
> 12152 14375 10983 13614 11077
>
> we get this on 4.5.5
>
> 37354 37650 37371 37190 37233
> 38498 37166 36862 37928 38509
>
> Notice how much more even the 4.5.5 results are, compared to 3.2.80.
>

how long each run was?  Generally, I do half-hour run to get stable results.

> (d) There's no sign of any benefit from any of the patches (it was only
> helpful >= 128 clients, but that's where the tps actually dropped on 3.2.80
> - apparently 4.5.5 fixes that and the benefit is gone).
>
> It's a bit annoying that after upgrading from 3.2.80 to 4.5.5, the
> performance with 32 and 64 clients dropped quite noticeably (by more than
> 10%). I believe that might be a kernel regression, but perhaps it's a price
> for improved scalability for higher client counts.
>
> It of course begs the question what kernel version is running on the machine
> used by Dilip (i.e. cthulhu)? Although it's a Power machine, so I'm not sure
> how much the kernel matters on it.
>

cthulhu is a x86 m/c and the kernel version is 3.10.  Seeing, the
above results I think kernel version do matter, but does that mean we
ignore the benefits we are seeing on somewhat older kernel version.  I
think right answer here is to do some experiments which can show the
actual contention as suggested by Robert and you.

> I'll ask someone else with access to this particular machine to repeat the
> tests, as I have a nagging suspicion that I've missed something important
> when compiling / running the benchmarks. I'll also retry the benchmarks on
> 3.2.80 to see if I get the same numbers.
>
>>
>> Okay, but I think it is better to see the results between 64~128
>> client count and may be greater than128 client counts, because it is
>> clear that patch won't improve performance below that.
>>
>
> There are results for 64, 128 and 192 clients. Why should we care about
> numbers in between? How likely (and useful) would it be to get improvement
> with 96 clients, but no improvement for 64 or 128 clients?
>

The only point to take was to see from where we have started seeing
improvement, saying that the TPS has improved from >=72 client count
is different from saying that it has improved from >=128.

>> No issues, I have already explained why I think it is important to
>> reduce the remaining CLOGControlLock contention in yesterday's and
>> this mail. If none of you is convinced, then I think we have no
>> choice but to drop this patch.
>>
>
> I agree it's useful to reduce lock contention in general, but considering
> the last set of benchmarks shows no benefit with recent kernel, I think we
> really need a better understanding of what's going on, what workloads /
> systems it's supposed to improve, etc.
>
> I don't dare to suggest rejecting the patch, but I don't see how we could
> commit any of the patches at this point. So perhaps "returned with feedback"
> and resubmitting in the next CF (along with analysis of improved workloads)
> would be appropriate.
>

Agreed with your conclusion and changed the status of patch in CF accordingly.

Many thanks for doing the tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-22 Thread Amit Kapila
On Fri, Sep 23, 2016 at 7:17 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 09/23/2016 03:20 AM, Robert Haas wrote:
>>
>> On Thu, Sep 22, 2016 at 7:44 PM, Tomas Vondra
>> <tomas.von...@2ndquadrant.com> wrote:
>>>
>>> I don't dare to suggest rejecting the patch, but I don't see how
>>> we could commit any of the patches at this point. So perhaps
>>> "returned with feedback" and resubmitting in the next CF (along
>>> with analysis of improvedworkloads) would be appropriate.
>>
>>
>> I think it would be useful to have some kind of theoretical analysis
>> of how much time we're spending waiting for various locks. So, for
>> example, suppose we one run of these tests with various client
>> counts - say, 1, 8, 16, 32, 64, 96, 128, 192, 256 - and we run
>> "select wait_event from pg_stat_activity" once per second throughout
>> the test. Then we see how many times we get each wait event,
>> including NULL (no wait event). Now, from this, we can compute the
>> approximate percentage of time we're spending waiting on
>> CLogControlLock and every other lock, too, as well as the percentage
>> of time we're not waiting for lock. That, it seems to me, would give
>> us a pretty clear idea what the maximum benefit we could hope for
>> from reducing contention on any given lock might be.
>>
>
> Yeah, I think that might be a good way to analyze the locks in general, not
> just got these patches. 24h run with per-second samples should give us about
> 86400 samples (well, multiplied by number of clients), which is probably
> good enough.
>
> We also have LWLOCK_STATS, that might be interesting too, but I'm not sure
> how much it affects the behavior (and AFAIK it also only dumps the data to
> the server log).
>

Right, I think LWLOCK_STATS give us the count of how many time we have
blocked due to particular lock like below where *blk* gives that
number.

PID 164692 lwlock main 11: shacq 2734189 exacq 146304 blk 73808
spindelay 73 dequeue self 57241

I think doing some experiments with both the techniques can help us to
take a call on these patches.

Do we want these experiments on different kernel versions or are we
okay with the current version on cthulhu (3.10) or we want to only
consider the results with latest kernel?

>>
>>
>> Now, we could also try that experiment with various patches. If we
>> can show that some patch reduces CLogControlLock contention without
>> increasing TPS, they might still be worth committing for that
>> reason. Otherwise, you could have a chicken-and-egg problem. If
>> reducing contention on A doesn't help TPS because of lock B and
>> visca-versa, then does that mean we can never commit any patch to
>> reduce contention on either lock? Hopefully not. But I agree with you
>> that there's certainly not enough evidence to commit any of these
>> patches now. To my mind, these numbers aren't convincing.
>>
>
> Yes, the chicken-and-egg problem is why the tests were done with unlogged
> tables (to work around the WAL lock).
>

Yeah, but I suspect still there was a impact due to ProcArrayLock.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-22 Thread Amit Kapila
On Tue, Sep 20, 2016 at 8:31 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Sep 20, 2016 at 9:24 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> I think here point is that for any case where there is count of rows
>> to be selected, we disable parallelism.  There are many genuine cases
>> like select count(*) into cnt ... which will run to completion, but as
>> plpgsql passes row count to be 1 or 2, it doesn't enter parallel mode.
>> There are couple other cases like that.  Do you see a reason for not
>> enabling parallelism for such cases?
>
> If we can somehow know that the rowcount which is passed is greater
> than or equal to the actual number of rows which will be generated,
> then it's fine to enable parallelism.
>

I think for certain cases like into clause, the rows passed will be
equal to actual number of rows, otherwise it will generate error.  So
we can pass that information in executor layer.  Another kind of cases
which are worth considering are when from plpgsql we fetch limited
rows at-a-time, but we fetch till end like the case of
exec_stmt_return_query().


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] README of hash index

2016-09-22 Thread Amit Kapila
On Fri, Sep 16, 2016 at 9:56 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Fri, Sep 16, 2016 at 4:20 AM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>> Currently README of hash module contain algorithms written in below form.
>>
>> The insertion algorithm is rather similar:
>>
>> pin meta page and take buffer content lock in shared mode
>> loop:
>> compute bucket number for target hash key
>> release meta page buffer content lock
>> if (correct bucket page is already locked)
>> break
>> release any existing bucket page lock (if a concurrent split happened)
>> take heavyweight bucket lock in shared mode
>> retake meta page buffer content lock in shared mode
>> -- (so far same as reader)
>> release pin on metapage
>> ..
>> ..
>>
>> I have mostly updated them in the patches I have proposed to improve
>> hash index.  However, each time I try to update them, I find that it
>> is easy to follow the code than to read and understand the existing
>> algorithm written in above form from README.
>>
>> Do others find it useful to maintain the algorithms in above form?
>
>
> I think that having them all condensed into one place makes it easier to
> think through the locking models to decide if there might be races or
> deadlocks.  If you only care about the algorithm for inserting in isolation,
> I agree reading the code might be better.
>


Thanks Jeff and Ken for the inputs.  I will keep the algorithms
updated in README of hash index.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-21 Thread Amit Kapila
On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapil...@gmail.com>
>> > wrote:
>> >>
>> >>
>> >> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> >> that, I have changed _hash_alloc_buckets() to initialize the page
>> >> instead of making it completely Zero because of problems discussed in
>> >> another related thread [1].  I have also updated README.
>> >>
>> >
>> > with v7 of the concurrent has patch and v4 of the write ahead log patch
>> > and
>> > the latest relcache patch (I don't know how important that is to
>> > reproducing
>> > this, I suspect it is not), I once got this error:
>> >
>> >
>> > 38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
>> > interrupted; last known up at 2016-09-19 16:25:49 PDT
>> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
>> > properly shut down; automatic recovery in progress
>> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at
>> > 3F/2200DE90
>> > 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
>> > failed,
>> > calculated checksum 65067 but expected 21260
>> > 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> > 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9
>> > of
>> > relation base/16384/17334
>> > 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> >
>> >
>> > The original page with the invalid checksum is:
>> >
>>
>> I think this is a example of torn page problem, which seems to be
>> happening because of the below code in your test.
>>
>> ! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
>> !RecoveryInProgress()) {
>> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
>> ! ereport(FATAL,
>> ! (errcode(ERRCODE_DISK_FULL),
>> !  errmsg("could not write block %u of relation %s: wrote only %d of %d
>> bytes",
>> ! blocknum,
>> ! relpath(reln->smgr_rnode, forknum),
>> ! nbytes, BLCKSZ),
>> !  errhint("JJ is screwing with the database.")));
>> ! } else {
>> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
>> ! }
>>
>> If you are running the above test by disabling JJ_torn_page, then it
>> is a different matter and we need to investigate it, but l assume you
>> are running by enabling it.
>>
>> I think this could happen if the actual change in page is in 2/3 part
>> of page which you are not writing in above code.  The checksum in page
>> header which is written as part of partial page write (1/3 part of
>> page) would have considered the actual change you have made whereas
>> after restart when it again read the page to apply redo, the checksum
>> calculation won't include the change being made in 2/3 part.
>
>
> Correct.  But any torn page write must be covered by the restoration of a
> full page image during replay, shouldn't it?  And that restoration should
> happen blindly, without first reading in the old page and verifying the
> checksum.
>

Probably, but I think this is not currently the way it is handled and
I don't want to change it.  AFAIU, what happens now is that we first
read the old page (and we do page verification while reading old page
and display *warning* if checksum doesn't match) and then restore the
image.  The relevant code path is
XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().

Now, here the point is that why instead of WARNING we are seeing FATAL
for the bitmap page of hash index.  The reason as explained in my
previous e-mail is that for bitmap page we are not logging full page
image and we should fix that as explained there.  Once the full page
image is logged, then first time it reads the torn page, it will use
flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
to WARNING.

I think this is the reason why Ashutosh is seeing WARNING for heap
page whereas you are seeing FATAL for bitmap page of hash index.

I don

Re: [HACKERS] Hash Indexes

2016-09-21 Thread Amit Kapila
On Thu, Sep 22, 2016 at 8:03 AM, Andres Freund <and...@anarazel.de> wrote:
> On 2016-09-21 22:23:27 -0400, Tom Lane wrote:
>> Andres Freund <and...@anarazel.de> writes:
>> > Sure. But that can be addressed, with a lot less effort than fixing and
>> > maintaining the hash indexes, by adding the ability to do that
>> > transparently using btree indexes + a recheck internally.  How that
>> > compares efficiency-wise is unclear as of now. But I do think it's
>> > something we should measure before committing the new code.
>>
>> TBH, I think we should reject that argument out of hand.  If someone
>> wants to spend time developing a hash-wrapper-around-btree AM, they're
>> welcome to do so.  But to kick the hash AM as such to the curb is to say
>> "sorry, there will never be O(1) index lookups in Postgres".
>
> Note that I'm explicitly *not* saying that. I just would like to see
> actual comparisons being made before investing significant amounts of
> code and related effort being invested in fixing the current hash table
> implementation. And I haven't seen a lot of that.
>

I think it can be deduced from testing done till now.  Basically, by
having index (btree/hash) on integer column can do the fair
comparison.  The size of key will be same in both hash and btree
index.  In such a case, if we know that hash index is performing
better in certain cases, then it is indication that it will perform
better in the scheme you are suggesting because it doesn't have extra
recheck in btree code which will further worsen the case for btree.

>  If the result of that
> comparison is that hash-indexes actually perform very well: Great!
>


>
>> always be superior, I don't see how it follows that we should refuse to
>> commit work that's already been done.  Is committing it somehow going to
>> prevent work on the btree-wrapper approach?
>
> The necessary work seems a good bit from finished.
>

Are you saying this about WAL patch?  If yes, then even if it is still
away from being in shape to committed, there is a lot of effort being
put in to taking into its current stage and it is not in bad shape
either.  It has survived lot of testing, there are still some bugs
which we are fixing.

One more thing, I want to say that don't assume that all people
involved in current development of hash indexes or further development
on it will run away once the code is committed and the responsibility
of maintenance will be on other senior members of community.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Re: [HACKERS] Windows service is not starting so there’s message in log: FATAL: "could not create shared memory segment “Global/Postgr

2016-09-21 Thread Amit Kapila
On Tue, Sep 20, 2016 at 10:33 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Thu, May 26, 2016 at 7:44 PM, Michael Paquier
>> <michael.paqu...@gmail.com> wrote:
>>> Yeah, it is definitely better to register it. And I have switched the
>>> patch as ready for committer just now.
>
>> Committed and back-patched to 9.4, where DSM was introduced.
>
> ISTM both the previous coding and this version can fail for no good
> reason, that is what if GetLastError() happens to return one of these
> error codes as a result of some unrelated failure from before this
> subroutine is entered?  That is, wouldn't it be a good idea to
> do SetLastError(0) before calling CreateFileMapping?
>

Yes, that seems like a good idea.  Do you need a patch with some
testing on windows environment?

>  Or is the latter
> guaranteed to do that on success?
>

I don't see any such guarantee from the msdn page and it appears from
GetLastError()/SetLastError() specs [1][2] that functions that set
last error code to zero on success, do mention it in their
documentation.


[1] - 
https://msdn.microsoft.com/en-us/library/windows/desktop/ms679360(v=vs.85).aspx
[2] - 
https://msdn.microsoft.com/en-us/library/windows/desktop/ms680627(v=vs.85).aspx



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-21 Thread Amit Kapila
On Wed, Sep 21, 2016 at 3:48 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
>
> I have no idea what's causing this - it might be related to the kernel, but
> I'm not sure why it should affect the patches differently. Let's see how the
> new kernel affects this.
>
>  dilip / sync=off   16   32   64  128 192
> --
>  master  261983790137211144418315
>  granular-locking258293839540626142998160
>  no-content-lock 258723899441053140588169
>  group-update265033891142993194748325
>
>  dilip / sync=on16   32   64  128 192
> --
>  master  261383779038492136538337
>  granular-locking256613858640692145358311
>  no-content-lock 256533905941169143708373
>  group-update264723917042126189238366
>
>  pgbench / sync=off 16   32   64  128 192
> --
>  master  230013576241202317898005
>  granular-locking232183613042535458508701
>  no-content-lock 233223655342772473948204
>  group-update231293617741788464198163
>
>  pgbench / sync=on  16   32   64  128 192
> --
>  master  229043607741295355748297
>  granular-locking233233625442446439098959
>  no-content-lock 233043667042606484408813
>  group-update231273669641859466938345
>
>
> So there is some improvement due to the patches for 128 clients (+30% in
> some cases), but it's rather useless as 64 clients either give you
> comparable performance (pgbench workload) or way better one (Dilip's
> workload).
>

I think these results are somewhat similar to what Dilip has reported.
Here, if you see in both cases, the performance improvement is seen
when the client count is greater than cores (including HT).  As far as
I know the m/c on which Dilip is running the tests also has 64 HT.
The point here is that the CLOGControlLock contention is noticeable
only at that client count, so it is not fault of patch that it is not
improving at lower client-count.  I guess that we will see performance
improvement between 64~128 client counts as well.


> Also, pretty much no difference between synchronous_commit on/off, probably
> thanks to running on unlogged tables.
>

Yeah.

> I'll repeat the test on the 4-socket machine with a newer kernel, but that's
> probably the last benchmark I'll do for this patch for now.
>

Okay, but I think it is better to see the results between 64~128
client count and may be greater than128 client counts, because it is
clear that patch won't improve performance below that.

> I agree with
> Robert that the cases the patch is supposed to improve are a bit contrived
> because of the very high client counts.
>

No issues, I have already explained why I think it is important to
reduce the remaining CLOGControlLock contention in yesterday's and
this mail.  If none of you is convinced, then I think we have no
choice but to drop this patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-20 Thread Amit Kapila
On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>>
>> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> that, I have changed _hash_alloc_buckets() to initialize the page
>> instead of making it completely Zero because of problems discussed in
>> another related thread [1].  I have also updated README.
>>
>
> with v7 of the concurrent has patch and v4 of the write ahead log patch and
> the latest relcache patch (I don't know how important that is to reproducing
> this, I suspect it is not), I once got this error:
>
>
> 38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
> interrupted; last known up at 2016-09-19 16:25:49 PDT
> 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
> properly shut down; automatic recovery in progress
> 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
> 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification failed,
> calculated checksum 65067 but expected 21260
> 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9 of
> relation base/16384/17334
> 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>
>
> The original page with the invalid checksum is:
>

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
!  errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
!  errhint("JJ is screwing with the database.")));
! } else {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code.  The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page.  I think this could happen though
rarely for any page with the above kind of tests.

Does this explanation explains the reason of problem you are seeing?

>
> If I ignore the checksum failure and re-start the system, the page gets
> restored to be a bitmap page.
>

Okay, but have you ensured in some way that redo is applied to bitmap page?


Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1].  That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards.  I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-20 Thread Amit Kapila
On Mon, Sep 19, 2016 at 11:39 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Sat, Sep 17, 2016 at 11:54 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> In general, I think we should support the cases as required (or
>> written) by you from plpgsql or sql functions.  We need more work to
>> support such cases. There are probably two ways of supporting such
>> cases, we can build some intelligence in plpgsql execution such that
>> it can recognise such queries and allow to use parallelism or we need
>> to think of enabling parallelism for cases where we don't run the plan
>> to completion.  Most of the use cases from plpgsql or sql function
>> fall into later category as they don't generally run the plan to
>> completion.
>
> I think there's certainly more work that could be done to teach
> PL/pgsql about cases where the query will run to completion.  I didn't
> work very hard to make sure we covered all of those; there are
> probably several different cases where parallelism could be safely
> enabled but currently isn't.  Also, I didn't do anything at all to
> update the other PLs, and that would be good, too.
>
> However, I think the chances of supporting parallel query for queries
> not executed to completion any time in the near future are very poor.
>

I think here point is that for any case where there is count of rows
to be selected, we disable parallelism.  There are many genuine cases
like select count(*) into cnt ... which will run to completion, but as
plpgsql passes row count to be 1 or 2, it doesn't enter parallel mode.
There are couple other cases like that.  Do you see a reason for not
enabling parallelism for such cases?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "Some tests to cover hash_index"

2016-09-20 Thread Amit Kapila
On Mon, Sep 19, 2016 at 8:44 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> On Sat, Aug 6, 2016 at 9:41 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> I wonder why you have included a new file for these tests, why can't be
>> these added to existing hash_index.sql.
> tests in hash_index.sql did not cover overflow pages, above tests were for
> mainly for them.
>

The name of file hash_index_split suggests it focus on split.  Are the
tests focussed more on overflow pages or on split of hash index?

 So I thought having a separate test file can help
> enabling/disabling them in schedule files, when we do not want them running
> as it take slightly high time. If you think otherwise I will reconsider will
> add tests to hash_index.sql.
>

I think you have a point, but not sure if it is worth to add a
separate file.  It might be tricky to choose which file to add new
tests for hash_indexes. Anybody else have opinion on this point?

Can you check how much time it takes as compare to btree or brin index tests?

I am facing below diff with your new patch.

***
*** 1,4 
! --
  -- Cause some overflow insert and splits.
  --
  CREATE TABLE hash_split_heap (keycol INT);
--- 1,4 
! --
  -- Cause some overflow insert and splits.
  --
  CREATE TABLE hash_split_heap (keycol INT);

==

There is an extra space in expected file which is leading to above failure.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-09-20 Thread Amit Kapila
On Tue, Sep 20, 2016 at 10:51 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
> <a.lubennik...@postgrespro.ru> wrote:
>> 28.08.2016 09:13, Amit Kapila:
>>
>> On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova
>> <a.lubennik...@postgrespro.ru> wrote:
>>
>>
>> So the patch is correct.
>> We can go further and remove this index_truncate_tuple() call, because
>> the first key of any inner (or root) page doesn't need any key at all.
>>
>
> Anyway, I think truncation happens if the page is at leaf level and
> that is ensured by check, so I think we can't remove this:
> + if (indnkeyatts != indnatts && P_ISLEAF(pageop))
>
>
> -- I have one more question regarding this truncate high-key concept.
> I think if high key is truncated, then during insertion, for cases
> like below it move to next page, whereas current page needs to be
> splitted.
>
> Assume index on c1,c2,c3 and c2,c3 are including columns.
>
> Actual high key on leaf Page X -
> 3, 2 , 2
> Truncated high key on leaf Page X
> 3
>
> New insertion key
> 3, 1, 2
>
> Now, I think for such cases during insertion if the page X doesn't
> have enough space, it will move to next page whereas ideally, it
> should split current page.  Refer function _bt_findinsertloc() for
> this logic.
>

Basically, here I wanted to know is that do we maintain ordering for
keys with respect to including columns while storing them (In above
example, do we ensure that 3,1,2 is always stored before 3,2,2)?

>
>
> -- I am getting Assertion failure when I use this patch with database
> created with a build before this patch.  However, if I create a fresh
> database it works fine.  Assertion failure details are as below:
>

I have tried this test on my Windows m/c only.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-09-19 Thread Amit Kapila
On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
> 28.08.2016 09:13, Amit Kapila:
>
> On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova
> <a.lubennik...@postgrespro.ru> wrote:
>
>
> So the patch is correct.
> We can go further and remove this index_truncate_tuple() call, because
> the first key of any inner (or root) page doesn't need any key at all.
>

Anyway, I think truncation happens if the page is at leaf level and
that is ensured by check, so I think we can't remove this:
+ if (indnkeyatts != indnatts && P_ISLEAF(pageop))


-- I have one more question regarding this truncate high-key concept.
I think if high key is truncated, then during insertion, for cases
like below it move to next page, whereas current page needs to be
splitted.

Assume index on c1,c2,c3 and c2,c3 are including columns.

Actual high key on leaf Page X -
3, 2 , 2
Truncated high key on leaf Page X
3

New insertion key
3, 1, 2

Now, I think for such cases during insertion if the page X doesn't
have enough space, it will move to next page whereas ideally, it
should split current page.  Refer function _bt_findinsertloc() for
this logic.

Is this truncation concept of high key needed for correctness of patch
or is it just to save space in index?   If you need this, then I think
nbtree/Readme needs to be updated.


-- I am getting Assertion failure when I use this patch with database
created with a build before this patch.  However, if I create a fresh
database it works fine.  Assertion failure details are as below:

LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
TRAP: unrecognized TOAST vartag("((bool) 1)", File: "src/backend/access/common/h
eaptuple.c", Line: 532)
LOG:  server process (PID 1404) was terminated by exception 0x8003
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal valu
e.
LOG:  terminating any other active server processes

--
@@ -1260,14 +1262,14 @@ RelationInitIndexAccessInfo(Relation relation)
  * Allocate arrays to hold data
  */
  relation->rd_opfamily = (Oid *)
- MemoryContextAllocZero(indexcxt, natts * sizeof(Oid));
+ MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid));
  relation->rd_opcintype = (Oid *)
- MemoryContextAllocZero(indexcxt, natts * sizeof(Oid));
+ MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid));

  amsupport = relation->rd_amroutine->amsupport;
  if (amsupport > 0)
  {
- int nsupport = natts * amsupport;
+ int nsupport = indnatts * amsupport;

  relation->rd_support = (RegProcedure *)
  MemoryContextAllocZero(indexcxt, nsupport * sizeof(RegProcedure));
@@ -1281,10 +1283,10 @@ RelationInitIndexAccessInfo(Relation relation)
  }

  relation->rd_indcollation = (Oid *)
- MemoryContextAllocZero(indexcxt, natts * sizeof(Oid));
+ MemoryContextAllocZero(indexcxt, indnatts * sizeof(Oid));

Can you add a comment in above code or some other related place as to
why you need some attributes in relcache entry of size indnkeyatts and
others of size indnatts?

--
@@ -63,17 +63,26 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 {
  ScanKey skey;
  TupleDesc itupdesc;
- int natts;
+ int indnatts,
+ indnkeyatts;
  int16   *indoption;
  int i;

  itupdesc = RelationGetDescr(rel);
- natts = RelationGetNumberOfAttributes(rel);
+ indnatts = IndexRelationGetNumberOfAttributes(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
  indoption = rel->rd_indoption;

- skey = (ScanKey) palloc(natts * sizeof(ScanKeyData));
+ Assert(indnkeyatts != 0);
+ Assert(indnkeyatts <= indnatts);

Here I think you need to declare indnatts as PG_USED_FOR_ASSERTS_ONLY,
otherwise it will give warning on some platforms.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-19 Thread Amit Kapila
On Tue, Sep 20, 2016 at 12:40 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Sun, Sep 18, 2016 at 5:11 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> IMHO in the ideal case the first message in this thread would provide a test
>> case, demonstrating the effect of the patch. Then we wouldn't have the issue
>> of looking for a good workload two years later.
>>
>> But now that I look at the first post, I see it apparently used a plain
>> tpc-b pgbench (with synchronous_commit=on) to show the benefits, which is
>> the workload I'm running right now (results sometime tomorrow).
>>
>> That workload clearly uses no savepoints at all, so I'm wondering why you
>> suggested to use several of them - I know you said that it's to show
>> differences between the approaches, but why should that matter to any of the
>> patches (and if it matters, why I got almost no differences in the
>> benchmarks)?
>>
>> Pardon my ignorance, CLOG is not my area of expertise ...
>
> It's possible that the effect of this patch depends on the number of
> sockets.  EDB test machine cthulhu as 8 sockets, and power2 has 4
> sockets.  I assume Dilip's tests were run on one of those two,
>

I think it is former (8 socket machine).

> although he doesn't seem to have mentioned which one.  Your system is
> probably 2 or 4 sockets, which might make a difference.  Results might
> also depend on CPU architecture; power2 is, unsurprisingly, a POWER
> system, whereas I assume you are testing x86.  Maybe somebody who has
> access should test on hydra.pg.osuosl.org, which is a community POWER
> resource.  (Send me a private email if you are a known community
> member who wants access for benchmarking purposes.)
>
> Personally, I find the results so far posted on this thread thoroughly
> unimpressive.  I acknowledge that Dilip's results appear to show that
> in a best-case scenario these patches produce a rather large gain.
> However, that gain seems to happen in a completely contrived scenario:
> astronomical client counts, unlogged tables, and a test script that
> maximizes pressure on CLogControlLock.
>

You are right that the scenario is somewhat contrived, but I think he
hasn't posted the results for simple-update or tpc-b kind of scenarios
for pgbench, so we can't conclude that those won't show benefit.  I
think we can see benefits with synchronous_commit=off as well may not
be as good as with unlogged tables.  The other thing to keep in mind
is that reducing contention on one lock (assume in this case
CLOGControlLock) also gives benefits when we reduce contention on
other locks (like ProcArrayLock, WALWriteLock, ..).  Last time we have
verified this effect with Andres's patch (cache the snapshot) which
reduces the remaining contention on ProcArrayLock.  We have seen that
individually that patch gives some benefit, but by removing the
contention on CLOGControlLock with the patches (increase the clog
buffers and grouping stuff, each one helps) discussed in this thread,
it gives much bigger benefit.

You point related to high-client count is valid and I am sure it won't
give noticeable benefit at lower client-count as the the
CLOGControlLock contention starts impacting only at high-client count.
I am not sure if it is good idea to reject a patch which helps in
stabilising the performance (helps in falling off the cliff) when the
processes increases the number of cores (or hardware threads)

>  If you have to work that hard
> to find a big win, and tests under more reasonable conditions show no
> benefit, it's not clear to me that it's really worth the time we're
> all spending benchmarking and reviewing this, or the risk of bugs, or
> the damage to the SLRU abstraction layer.

I agree with you unless it shows benefit on somewhat more usual
scenario's, we should not accept it.  So shouldn't we wait for results
of other workloads like simple-update or tpc-b on bigger machines
before reaching to conclusion?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.

2016-09-19 Thread Amit Kapila
On Sat, Sep 10, 2016 at 5:02 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Tue, Aug 30, 2016 at 8:01 AM, Andres Freund <and...@anarazel.de> wrote:
>> On 2016-08-30 07:57:19 +0530, Amit Kapila wrote:
>>> I will write such a test case either in this week or early next week.
>>
>> Great.
>>
>
> Okay, attached patch adds some simple tests for pg_stat_statements.
> One thing to note is that we can't add pg_stat_statements tests for
> installcheck as this module requires shared_preload_libraries to be
> set.  Hope this suffices the need.
>

Registered this patch for next CF.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possibly too stringent Assert() in b-tree code

2016-09-19 Thread Amit Kapila
On Mon, Sep 19, 2016 at 1:32 PM, Antonin Houska <a...@cybertec.at> wrote:
> I've recently seen this when using 9.6:
>
> #0  0x7f147892f0c7 in raise () from /lib64/libc.so.6
> #1  0x7f1478930478 in abort () from /lib64/libc.so.6
> #2  0x009683a1 in ExceptionalCondition (conditionName=0x9f2ef8 
> "!(((PageHeader) (page))->pd_special >= (__builtin_offsetof (PageHeaderData, 
> pd_linp)))", errorType=0x9f2e8a "FailedAssertion",
> fileName=0x9f2e60 "../../../../src/include/storage/bufpage.h", 
> lineNumber=314) at assert.c:54
> #3  0x004ee3d2 in PageValidateSpecialPointer (page=0x7f14700f3480 "") 
> at ../../../../src/include/storage/bufpage.h:314
> #4  0x004ef913 in _bt_getbuf (rel=0x7f14795a1a50, blkno=11, access=2) 
> at nbtpage.c:629
> #5  0x004eae8c in _bt_split (rel=0x7f14795a1a50, buf=136, cbuf=0, 
> firstright=367, newitemoff=408, newitemsz=16, newitem=0x2608330, 
> newitemonleft=0 '\000') at nbtinsert.c:986
> #6  0x004ea563 in _bt_insertonpg (rel=0x7f14795a1a50, buf=136, 
> cbuf=0, stack=0x26090c8, itup=0x2608330, newitemoff=408, split_only_page=0 
> '\000') at nbtinsert.c:781
> #7  0x004e954c in _bt_doinsert (rel=0x7f14795a1a50, itup=0x2608330, 
> checkUnique=UNIQUE_CHECK_YES, heapRel=0x25aa0a0) at nbtinsert.c:204
> #8  0x004f3b73 in btinsert (rel=0x7f14795a1a50, 
> values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "", ht_ctid=0x260927c, 
> heapRel=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES) at nbtree.c:279
> #9  0x004e7964 in index_insert (indexRelation=0x7f14795a1a50, 
> values=0x7ffe46c6f7f0, isnull=0x7ffe46c6f7d0 "", heap_t_ctid=0x260927c, 
> heapRelation=0x25aa0a0, checkUnique=UNIQUE_CHECK_YES)
> at indexam.c:204
>
> Of course, the database could have been corrupted after having encountered
> many crashes during my experiments. Neverthelesss, even without in-depth
> knowledge of the b-tree code I suspect that this stack trace might reflect a
> legal situation. In partcular, if _bt_page_recyclable() returned on this
> condition:
>
> if (PageIsNew(page))
> return true;
>

I think you have a valid point.  It seems we don't need to write WAL
for reuse page (aka don't call _bt_log_reuse_page()), if the page is
new, as the only purpose of that log is to handle conflict based on
transaction id stored in special area which will be anyway zero.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-19 Thread Amit Kapila
On Mon, Sep 19, 2016 at 11:20 AM, Mark Kirkwood
<mark.kirkw...@catalyst.net.nz> wrote:
>
>
> On 17/09/16 06:38, Andres Freund wrote:
>>
>> On 2016-09-16 09:12:22 -0700, Jeff Janes wrote:
>>>
>>> On Thu, Sep 15, 2016 at 7:23 AM, Andres Freund <and...@anarazel.de>
>>> wrote:
>>>>
>>>> One earlier question about this is whether that is actually a worthwhile
>>>> goal.  Are the speed and space benefits big enough in the general case?
>>>> Could those benefits not be achieved in a more maintainable manner by
>>>> adding a layer that uses a btree over hash(columns), and adds
>>>> appropriate rechecks after heap scans?
>>>>
>>>> Note that I'm not saying that hash indexes are not worthwhile, I'm just
>>>> doubtful that question has been explored sufficiently.
>>>
>>> I think that exploring it well requires good code.  If the code is good,
>>> why not commit it?
>>
>> Because getting there requires a lot of effort, debugging it afterwards
>> would take effort, and maintaining it would also takes a fair amount?
>> Adding code isn't free.
>>
>> I'm rather unenthused about having a hash index implementation that's
>> mildly better in some corner cases, but otherwise doesn't have much
>> benefit. That'll mean we'll have to step up our user education a lot,
>> and we'll have to maintain something for little benefit.
>>
>
> While I see the point of what you are saying here, I recall all previous
> discussions about has indexes tended to go a bit like this:
>
> - until WAL logging of hash indexes is written it is not worthwhile trying
> to make improvements to them
> - WAL logging will be a lot of work, patches 1st please
>
> Now someone has done that work, and we seem to be objecting that because
> they are not improved then the patches are (maybe) not worthwhile.
>

I think saying hash indexes are not improved after proposed set of
patches is an understatement.  The read performance has improved by
more than 80% as compare to HEAD [1] (refer data in Mithun's mail).
Also, tests by Mithun and Jesper has indicated that in multiple
workloads, they are better than BTREE by 30~60% (in fact Jesper
mentioned that he is seeing 40~60% benefit on production database,
Jesper correct me if I am wrong.).  I agree that when index column is
updated they are much worse than btree as of now, but no work has been
done improve it and I am sure that it can be improved for those cases
as well.

In general, I thought the tests done till now are sufficient to prove
the importance of work, but if still Andres and others have doubt and
they want to test some specific cases, then sure we can do more
performance benchmarking.

Mark, thanks for supporting the case for improving Hash Indexes.


[1] - 
https://www.postgresql.org/message-id/CAD__OugX0aOa7qopz3d-nbBAoVmvSmdFJOX4mv5tFRpijqH47A%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-17 Thread Amit Kapila
On Sat, Sep 17, 2016 at 11:25 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 09/17/2016 07:05 AM, Amit Kapila wrote:
>>
>> On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
>> <tomas.von...@2ndquadrant.com> wrote:
>>>
>>> On 09/14/2016 05:29 PM, Robert Haas wrote:
>
> ...
>>>>
>>>> Sure, but you're testing at *really* high client counts here.
>>>> Almost nobody is going to benefit from a 5% improvement at 256
>>>> clients. You need to test 64 clients and 32 clients and 16
>>>> clients and 8 clients and see what happens there. Those cases are
>>>> a lot more likely than these stratospheric client counts.
>>>>
>>>
>>> Right. My impression from the discussion so far is that the patches
>>> only improve performance with very many concurrent clients - but as
>>> Robert points out, almost no one is running with 256 active
>>> clients, unless they have 128 cores or so. At least not if they
>>> value latency more than throughput.
>>>
>>
>> See, I am also not in favor of going with any of these patches, if
>> they doesn't help in reduction of contention. However, I think it is
>> important to understand, under what kind of workload and which
>> environment it can show the benefit or regression whichever is
>> applicable.
>
>
> Sure. Which is why I initially asked what type of workload should I be
> testing, and then done the testing with multiple savepoints as that's what
> you suggested. But apparently that's not a workload that could benefit from
> this patch, so I'm a bit confused.
>
>> Just FYI, couple of days back one of EDB's partner who was doing the
>> performance tests by using HammerDB (which is again OLTP focussed
>> workload) on 9.5 based code has found that CLogControlLock has the
>> significantly high contention. They were using synchronous_commit=off
>> in their settings. Now, it is quite possible that with improvements
>> done in 9.6, the contention they are seeing will be eliminated, but
>> we have yet to figure that out. I just shared this information to you
>> with the intention that this seems to be a real problem and we should
>> try to work on it unless we are able to convince ourselves that this
>> is not a problem.
>>
>
> So, can we approach the problem from this direction instead? That is,
> instead of looking for workloads that might benefit from the patches, look
> at real-world examples of CLOG lock contention and then evaluate the impact
> on those?
>

Sure, we can go that way as well, but I thought instead of testing
with a new benchmark kit (HammerDB), it is better to first get with
some simple statements.

> Extracting the workload from benchmarks probably is not ideal, but it's
> still better than constructing the workload on our own to fit the patch.
>
> FWIW I'll do a simple pgbench test - first with synchronous_commit=on and
> then with synchronous_commit=off. Probably the workloads we should have
> started with anyway, I guess.
>

Here, synchronous_commit = off case could be interesting.  Do you see
any problem with first trying a workload where Dilip is seeing
benefit?  I am not suggesting we should not do any other testing, but
just first lets try to reproduce the performance gain which is seen in
Dilip's tests.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-17 Thread Amit Kapila
On Fri, Sep 16, 2016 at 8:48 PM, Alex Ignatov <a.igna...@postgrespro.ru> wrote:
>
> On 16.09.2016 16:50, Amit Kapila wrote:
>>
>>
>> Can you try by setting force_parallel_mode = off;?  I think it is
>> sending the whole function execution to worker due to
>> force_parallel_mode.
>>
>>
>
> No changes:
>

Okay, it just skipped from my mind that we don't support parallel
queries for SQL statement execution (or statements executed via
exec_stmt_execsql) from plpgsql.  For detailed explanation of why that
is not feasible you can refer one of my earlier e-mails [1] on similar
topic.  I think if we can somehow get the results via Perform
statement, then it could be possible to use parallelism via plpgsql.

However, you can use it via SQL functions, an example is below:

set min_parallel_relation_size =0;
set parallel_tuple_cost=0;
set parallel_setup_cost=0;

Load 'auto_explain';
set auto_explain.log_min_duration = 0;
set auto_explain.log_analyze = true;
set auto_explain.log_nested_statements = true;

create table test_plpgsql(c1 int, c2 char(1000));
insert into test_plpgsql values(generate_series(1,10),'aaa');

create or replace function parallel_test_set_sql() returns
setof bigint as $$
select count(*) from test_plpgsql;
$$language sql PARALLEL SAFE STRICT STABLE;

Then execute function as: select * from parallel_test_set_sql();  You
can see below plan if auto_explain module is loaded.

Finalize Aggregate  (cost=14806.85..14806.86 rows=1 width=8) (actual tim
e=1094.966..1094.967 rows=1 loops=1)
  ->  Gather  (cost=14806.83..14806.84 rows=2 width=8) (actual time=472.
216..1094.943 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
->  Partial Aggregate  (cost=14806.83..14806.84 rows=1 width=8)
(actual time=177.867..177.868 rows=1 loops=3)
  ->  Parallel Seq Scan on test_plpgsql  (cost=0.00..14702.6
7 rows=41667 width=0) (actual time=0.384..142.565 rows=3 loops=3)
CONTEXT:  SQL function "parallel_test_set_sql" statement 1
LOG:  duration: 2965.040 ms  plan:
Query Text: select * from parallel_test_set_sql();
Function Scan on parallel_test_set_sql  (cost=0.25..10.25 rows=1000 widt
h=8) (actual time=2538.620..2776.955 rows=1 loops=1)


In general, I think we should support the cases as required (or
written) by you from plpgsql or sql functions.  We need more work to
support such cases. There are probably two ways of supporting such
cases, we can build some intelligence in plpgsql execution such that
it can recognise such queries and allow to use parallelism or we need
to think of enabling parallelism for cases where we don't run the plan
to completion.  Most of the use cases from plpgsql or sql function
fall into later category as they don't generally run the plan to
completion.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1K8kaO_jRk42-o2rmhSRbKV-3mR%2BiNVcONLdbcSXW5TfQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rename max_parallel_degree?

2016-09-17 Thread Amit Kapila
On Fri, Sep 16, 2016 at 11:54 PM, Robert Haas <robertmh...@gmail.com> wrote:
>
> +   /*
> +* We need a write barrier to make sure the update of
> +* parallel_terminate_count is done before the store to in_use
> +*/
>
> Does the order actually matter here?
>

I think so.  If slot->in_use is reordered before the check of
is_parallel_worker, then it is possible that concurrent registration
of worker can mark the is_parallel_worker as false before we check the
flag here.  See explanation in previous e-mail [1].


[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BQ_DdJ28qXYSHZiBKNf2MY1QFrv5XAOAw4ZXHw4TPMxA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-16 Thread Amit Kapila
On Sat, Sep 17, 2016 at 9:17 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 09/14/2016 05:29 PM, Robert Haas wrote:
>>
>> On Wed, Sep 14, 2016 at 12:55 AM, Dilip Kumar <dilipbal...@gmail.com>
>> wrote:
>>>
>>> 2. Results
>>> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f
>>> script.sql
>>> scale factor: 300
>>> Clients   head(tps)grouplock(tps)  granular(tps)
>>> ---  -   --   ---
>>> 12829367 3932637421
>>> 18029777 3781036469
>>> 25628523 3741835882
>>>
>>>
>>> grouplock --> 1) Group mode to reduce CLOGControlLock contention
>>> granular  --> 2) Use granular locking model
>>>
>>> I will test with 3rd approach also, whenever I get time.
>>>
>>> 3. Summary:
>>> 1. I can see on head we are gaining almost ~30 % performance at higher
>>> client count (128 and beyond).
>>> 2. group lock is ~5% better compared to granular lock.
>>
>>
>> Sure, but you're testing at *really* high client counts here.  Almost
>> nobody is going to benefit from a 5% improvement at 256 clients.  You
>> need to test 64 clients and 32 clients and 16 clients and 8 clients
>> and see what happens there.  Those cases are a lot more likely than
>> these stratospheric client counts.
>>
>
> Right. My impression from the discussion so far is that the patches only
> improve performance with very many concurrent clients - but as Robert points
> out, almost no one is running with 256 active clients, unless they have 128
> cores or so. At least not if they value latency more than throughput.
>

See, I am also not in favor of going with any of these patches, if
they doesn't help in reduction of contention.  However, I think it is
important to understand, under what kind of workload and which
environment it can show the benefit or regression whichever is
applicable.  Just FYI, couple of days back one of EDB's partner who
was doing the performance tests by using HammerDB (which is again OLTP
focussed workload) on 9.5 based code has found that CLogControlLock
has the significantly high contention.  They were using
synchronous_commit=off in their settings.  Now, it is quite possible
that with improvements done in 9.6, the contention they are seeing
will be eliminated, but we have yet to figure that out.  I just shared
this information to you with the intention that this seems to be a
real problem and we should try to work on it unless we are able to
convince ourselves that this is not a problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-16 Thread Amit Kapila
On Sat, Sep 17, 2016 at 9:12 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 09/17/2016 05:23 AM, Amit Kapila wrote:
>>
>> The difference between these and tests performed by Dilip is that he
>> has lesser savepoints.  I think if you want to try it again, then can
>> you once do it with either no savepoint or 1~2 savepoints.  The other
>> thing you could try out is the same test as Dilip has done (with and
>> without 2 savepoints).
>>
>
> I don't follow. My understanding is the patches should make savepoints
> cheaper - so why would using fewer savepoints increase the effect of the
> patches?
>

Oh, no the purpose of the patch is not to make savepoints cheaper (I
know I have earlier suggested to check by having few savepoints, but
that was not intended to mean that this patch makes savepoint cheaper,
rather it might show the difference between different approaches,
sorry if that was not clearly stated earlier).  The purpose of this
patch('es) is to make commits cheaper and in particular updating the
status in CLOG.  Let me try to explain in brief about the CLOG
contention and what these patches try to accomplish.  As of head, when
we try to update the status in CLOG (TransactionIdSetPageStatus), we
take CLOGControlLock in EXCLUSIVE mode for reading the appropriate
CLOG page (most of the time, it will be in memory, so it is cheap) and
then updating the transaction status in the same.  We take
CLOGControlLock in SHARED mode (if we the required clog page is in
memory, otherwise the lock is upgraded to Exclusive) while reading the
transaction status which happen when we access the tuple where hint
bit is not set.

So, we have two different type of contention around CLOGControlLock,
(a) all the transactions that try to commit at same time, each of them
have to do it almost serially (b) readers of transaction status
contend with writers.

Now with the patch that went in 9.6 (increasing the clog buffers), the
second type of contention is mostly reduced as most of the required
pages are in-memory and we are hoping that this patch will help in
reducing first type (a) of contention as well.

>>
>
> I'm OK with running Dilip's tests, but I'm not sure why there's not much
> point in running the tests I've done. Or perhaps I'd like to understand why
> "my tests" show no improvement whatsoever first - after all, they're not
> that different from Dilip's.
>

The test which Dilip is doing "Select ... For Update" is mainly aiming
at first type (a) of contention as it doesn't change the hint bits, so
mostly it should not go for reading the transaction status when
accessing the tuple.  Whereas, the tests you are doing is mainly
focussed on second type (b) of contention.

I think one point we have to keep in mind here is that this contention
is visible in bigger socket m/c, last time Jesper also tried these
patches, but didn't find much difference in his environment and on
further analyzing (IIRC) we found that the reason was that contention
was not visible in his environment.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-16 Thread Amit Kapila
On Sat, Sep 17, 2016 at 6:54 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 09/14/2016 06:04 PM, Dilip Kumar wrote:
>>
>> On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas <robertmh...@gmail.com>
>> wrote:
>>>
>>> Sure, but you're testing at *really* high client counts here.  Almost
>>> nobody is going to benefit from a 5% improvement at 256 clients.
>>
>>
>> I agree with your point, but here we need to consider one more thing,
>> that on head we are gaining ~30% with both the approaches.
>>
>> So for comparing these two patches we can consider..
>>
>> A.  Other workloads (one can be as below)
>>-> Load on CLogControlLock at commit (exclusive mode) + Load on
>> CLogControlLock at Transaction status (shared mode).
>>I think we can mix (savepoint + updates)
>>
>> B. Simplicity of the patch (if both are performing almost equal in all
>> practical scenarios).
>>
>> C. Bases on algorithm whichever seems winner.
>>
>> I will try to test these patches with other workloads...
>>
>>>  You
>>> need to test 64 clients and 32 clients and 16 clients and 8 clients
>>> and see what happens there.  Those cases are a lot more likely than
>>> these stratospheric client counts.
>>
>>
>> I tested with 64 clients as well..
>> 1. On head we are gaining ~15% with both the patches.
>> 2. But group lock vs granular lock is almost same.
>>
>
>
> The transaction is using a single unlogged table initialized like this:
>
> create unlogged table t(id int, val int);
> insert into t select i, i from generate_series(1,10) s(i);
> vacuum t;
> create index on t(id);
>
> (I've also ran it with 100M rows, called "large" in the results), and
> pgbench is running this transaction:
>
> \set id random(1, 10)
>
> BEGIN;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s1;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s2;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s3;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s4;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s5;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s6;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s7;
> UPDATE t SET val = val + 1 WHERE id = :id;
> SAVEPOINT s8;
> COMMIT;
>
> So 8 simple UPDATEs interleaved by savepoints.
>

The difference between these and tests performed by Dilip is that he
has lesser savepoints.  I think if you want to try it again, then can
you once do it with either no savepoint or 1~2 savepoints.  The other
thing you could try out is the same test as Dilip has done (with and
without 2 savepoints).

> The benchmark was running on
> a machine with 256GB of RAM, 32 cores (4x E5-4620) and a fairly large SSD
> array. I'd done some basic tuning on the system, most importantly:
>
> effective_io_concurrency = 32
> work_mem = 512MB
> maintenance_work_mem = 512MB
> max_connections = 300
> checkpoint_completion_target = 0.9
> checkpoint_timeout = 3600
> max_wal_size = 128GB
> min_wal_size = 16GB
> shared_buffers = 16GB
>
> Although most of the changes probably does not matter much for unlogged
> tables (I planned to see how this affects regular tables, but as I see no
> difference for unlogged ones, I haven't done that yet).
>

You are right.  Unless, we don't see the benefit with unlogged tables,
there is no point in doing it for regular tables.

> So the question is why Dilip sees +30% improvement, while my results are
> almost exactly the same. Looking at Dilip's benchmark, I see he only ran the
> test for 10 seconds, and I'm not sure how many runs he did, warmup etc.
> Dilip, can you provide additional info?
>
> I'll ask someone else to redo the benchmark after the weekend to make sure
> it's not actually some stupid mistake of mine.
>

I think there is not much point in repeating the tests you have done,
rather it is better if we can try again the tests done by Dilip in
your environment to see the results.

Thanks for doing the tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel sec scan in plpgsql

2016-09-16 Thread Amit Kapila
On Fri, Sep 16, 2016 at 6:57 PM, Alex Ignatov <a.igna...@postgrespro.ru> wrote:
> No it doesn't.
> Paralleling neither sql function nor plpgsql:
> Here is example :
>
> ipdr=> show max_worker_processes ;
>  max_worker_processes
> --
>  128
> (1 row)
> ipdr=> set max_parallel_workers_per_gather to 128;
> SET
> ipdr=> set force_parallel_mode=on;
> SET
> ipdr=> set min_parallel_relation_size =0;
> SET
> ipdr=> set parallel_tuple_cost=0;
> SET
>

Can you try by setting force_parallel_mode = off;?  I think it is
sending the whole function execution to worker due to
force_parallel_mode.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] README of hash index

2016-09-16 Thread Amit Kapila
Currently README of hash module contain algorithms written in below form.

The insertion algorithm is rather similar:

pin meta page and take buffer content lock in shared mode
loop:
compute bucket number for target hash key
release meta page buffer content lock
if (correct bucket page is already locked)
break
release any existing bucket page lock (if a concurrent split happened)
take heavyweight bucket lock in shared mode
retake meta page buffer content lock in shared mode
-- (so far same as reader)
release pin on metapage
..
..

I have mostly updated them in the patches I have proposed to improve
hash index.  However, each time I try to update them, I find that it
is easy to follow the code than to read and understand the existing
algorithm written in above form from README.

Do others find it useful to maintain the algorithms in above form?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-16 Thread Amit Kapila
On Thu, Sep 15, 2016 at 10:38 PM, Jesper Pedersen
<jesper.peder...@redhat.com> wrote:
> On 09/15/2016 02:03 AM, Amit Kapila wrote:
>>>
>>> Same thing here - where the fields involving the hash index aren't
>>> updated.
>>>
>>
>> Do you mean that for such cases also you see 40-60% gain?
>>
>
> No, UPDATEs are around 10-20% for our cases.
>

Okay, good to know.

>>
>> It might be useful to test with higher number of rows because with so
>> less data contention is not visible,
>
>
> Attached is a run with 1000 rows.
>

I think 1000 is also less, you probably want to run it for 100,000 or
more rows.  I suspect that the reason why you are seeing the large
difference between btree and hash index is that the range of values is
narrow and there may be many overflow pages.

>>
>
> I think for CHI is would be Robert's and others feedback. For WAL, there is
> [1].
>

I have fixed your feedback for WAL and posted the patch.  I think the
remaining thing to handle for Concurrent Hash Index patch is to remove
the usage of hashscan.c from code if no one objects to it, do let me
know if I am missing something here.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-16 Thread Amit Kapila
On Thu, Sep 15, 2016 at 7:53 PM, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> On 2016-05-10 17:39:22 +0530, Amit Kapila wrote:
>> For making hash indexes usable in production systems, we need to improve
>> its concurrency and make them crash-safe by WAL logging them.
>
> One earlier question about this is whether that is actually a worthwhile
> goal.  Are the speed and space benefits big enough in the general case?
>

I think there will surely by speed benefits w.r.t reads for larger
indexes.  All our measurements till now have shown that there is a
benefit varying from 30~60% (for reads) with hash index as compare to
btree, and I think it could be even more if we further increase the
size of index.  On space front, I have not done any detailed study, so
it is not right to conclude anything, but it appears to me that if the
index is on char/varchar column where size of key is 10 or 20 bytes,
hash indexes should be beneficial as they store just hash-key.

> Could those benefits not be achieved in a more maintainable manner by
> adding a layer that uses a btree over hash(columns), and adds
> appropriate rechecks after heap scans?
>

I don't think it can be faster for reads than using real hash index,
but surely one can have that as a workaround.

> Note that I'm not saying that hash indexes are not worthwhile, I'm just
> doubtful that question has been explored sufficiently.
>

I think theoretically hash indexes are faster than btree considering
logarithmic complexity (O(1) vs. O(logn)), also the results after
recent optimizations indicate that hash indexes are faster than btree
for equal to searches.  I am not saying after the recent set of
patches proposed for hash indexes they will be better in all kind of
cases.  It could be beneficial for cases where indexed columns are not
updated heavily.

I think one can definitely argue that we can some optimizations in
btree and make them equivalent or better than hash indexes, but I am
not sure if it is possible for all-kind of use-cases.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-15 Thread Amit Kapila
On Thu, Sep 15, 2016 at 7:25 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 2:13 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> One other point, I would like to discuss is that currently, we have a
>> concept for tracking active hash scans (hashscan.c) which I think is
>> mainly to protect splits when the backend which is trying to split has
>> some scan open. You can read "Other Notes" section of
>> access/hash/README for further details.  I think after this patch we
>> don't need that mechanism for splits because we always retain a pin on
>> bucket buffer till all the tuples are fetched or scan is finished
>> which will defend against a split by our own backend which tries to
>> ensure cleanup lock on bucket.
>
> Hmm, yeah.  It seems like we can remove it.
>
>> However, we might need it for vacuum
>> (hashbulkdelete), if we want to get rid of cleanup lock in vacuum,
>> once we have a page-at-a-time scan mode implemented for hash indexes.
>> If you agree with above analysis, then we can remove the checks for
>> _hash_has_active_scan from both vacuum and split path and also remove
>> corresponding code from hashbegin/end scan, but retain that hashscan.c
>> for future improvements.
>
> Do you have a plan for that?  I'd be inclined to just blow away
> hashscan.c if we don't need it any more, unless you're pretty sure
> it's going to get reused.  It's not like we can't pull it back out of
> git if we decide we want it back after all.
>

I do want to work on it, but it is always possible that due to some
other work this might get delayed.  Also, I think there is always a
chance that while doing that work, we face some problem due to which
we might not be able to use that optimization.  So I will go with your
suggestion of removing hashscan.c and it's usage for now and then if
required we will pull it back.  If nobody else thinks otherwise, I
will update this in next patch version.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-15 Thread Amit Kapila
One other point, I would like to discuss is that currently, we have a
concept for tracking active hash scans (hashscan.c) which I think is
mainly to protect splits when the backend which is trying to split has
some scan open. You can read "Other Notes" section of
access/hash/README for further details.  I think after this patch we
don't need that mechanism for splits because we always retain a pin on
bucket buffer till all the tuples are fetched or scan is finished
which will defend against a split by our own backend which tries to
ensure cleanup lock on bucket.  However, we might need it for vacuum
(hashbulkdelete), if we want to get rid of cleanup lock in vacuum,
once we have a page-at-a-time scan mode implemented for hash indexes.
If you agree with above analysis, then we can remove the checks for
_hash_has_active_scan from both vacuum and split path and also remove
corresponding code from hashbegin/end scan, but retain that hashscan.c
for future improvements.

I am posting this as a separate mail to avoid it getting lost as one
of the points in long list of review points discussed.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-15 Thread Amit Kapila
On Thu, Sep 15, 2016 at 12:43 AM, Jesper Pedersen
<jesper.peder...@redhat.com> wrote:
> Hi,
>
> On 09/14/2016 07:24 AM, Amit Kapila wrote:

>
>>> UPDATE also sees an improvement.
>>>
>>
>> Can you explain this more?  Is it more compare to HEAD or more as
>> compare to Btree?  Isn't this contradictory to what the test in below
>> mail shows?
>>
>
> Same thing here - where the fields involving the hash index aren't updated.
>

Do you mean that for such cases also you see 40-60% gain?

>
> I have done a run to look at the concurrency / TPS aspect of the
> implementation - to try something different than Mark's work on testing the
> pgbench setup.
>
> With definitions as above, with SELECT as
>
> -- select.sql --
> \set id random(1,10)
> BEGIN;
> SELECT * FROM test WHERE id = :id;
> COMMIT;
>
> and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed w/o
> one.
>
> [1] [2] [3] is new_hash - old_hash is the existing hash implementation on
> master. btree is master too.
>
> Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal.
> Clients ran with -M prepared.
>
> [1]
> https://www.postgresql.org/message-id/caa4ek1+erbp+7mdkkahjzwq_dtdkocbpt7lswfwcqvuhbxz...@mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1wp5vylvk...@mail.gmail.com
> [3]
> https://www.postgresql.org/message-id/caa4ek1juyr_ab7bxfnsg5+jqhiwgklkgacfk9bfd4mlffk6...@mail.gmail.com
>
> Don't know if you find this useful due to the small number of rows, but let
> me know if there are other tests I can run, f.ex. bump the number of rows.
>

It might be useful to test with higher number of rows because with so
less data contention is not visible, but I think in general with your,
jeff's and mine own tests it is clear that there is significant win
for read-only cases and for read-write cases where index column is not
updated.  Also, we don't find any regression as compare to HEAD which
is sufficient to prove the worth of patch.  I think we should not
forget that one of the other main reason for this patch is to allow
WAL logging for hash indexes.  I think for now, we have done
sufficient tests for this patch to ensure it's benefit, now if any
committer wants to see something more we can surely do it.  I think
the important thing at this stage is to find out what more (if
anything) is left to make this patch as "ready for committer".


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-14 Thread Amit Kapila
On Thu, Sep 15, 2016 at 4:44 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Tue, May 10, 2016 at 5:09 AM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>>
>>
>> Although, I don't think it is a very good idea to take any performance
>> data with WIP patch, still I couldn't resist myself from doing so and below
>> are the performance numbers.  To get the performance data, I have dropped
>> the primary key constraint on pgbench_accounts and created a hash index on
>> aid column as below.
>>
>> alter table pgbench_accounts drop constraint pgbench_accounts_pkey;
>> create index pgbench_accounts_pkey on pgbench_accounts using hash(aid);
>
>
>
> To be rigorously fair, you should probably replace the btree primary key
> with a non-unique btree index and use that in the btree comparison case.  I
> don't know how much difference that would make, probably none at all for a
> read-only case.
>
>>
>>
>>
>> Below data is for read-only pgbench test and is a median of 3 5-min runs.
>> The performance tests are executed on a power-8 m/c.
>
>
> With pgbench -S where everything fits in shared_buffers and the number of
> cores I have at my disposal, I am mostly benchmarking interprocess
> communication between pgbench and the backend.  I am impressed that you can
> detect any difference at all.
>
> For this type of thing, I like to create a server side function for use in
> benchmarking:
>
> create or replace function pgbench_query(scale integer,size integer)
> RETURNS integer AS $$
> DECLARE sum integer default 0;
> amount integer;
> account_id integer;
> BEGIN FOR i IN 1..size LOOP
>account_id=1+floor(random()*scale);
>SELECT abalance into strict amount FROM pgbench_accounts
>   WHERE aid = account_id;
>sum := sum + amount;
> END LOOP;
> return sum;
> END $$ LANGUAGE plpgsql;
>
> And then run using a command like this:
>
> pgbench -f <(echo 'select pgbench_query(40,1000)')  -c$j -j$j -T 300
>
> Where the first argument ('40', here) must be manually set to the same value
> as the scale-factor.
>
> With 8 cores and 8 clients, the values I get are, for btree, hash-head,
> hash-concurrent, hash-concurrent-cache, respectively:
>
> 598.2
> 577.4
> 668.7
> 664.6
>
> (each transaction involves 1000 select statements)
>
> So I do see that the concurrency patch is quite an improvement.  The cache
> patch does not produce a further improvement, which was somewhat surprising
> to me (I thought that that patch would really shine in a read-write
> workload, but I expected at least improvement in read only)
>

To see the benefit from cache meta page patch, you might want to test
with clients more than the number of cores, atleast that is what data
by Mithun [1] indicates or probably in somewhat larger m/c.


[1] - 
https://www.postgresql.org/message-id/CAD__OugX0aOa7qopz3d-nbBAoVmvSmdFJOX4mv5tFRpijqH47A%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-14 Thread Amit Kapila
On Thu, Sep 15, 2016 at 4:04 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Tue, Sep 13, 2016 at 9:31 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>>
>> ===
>>
>> +Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
>> tuples
>> +that are moved due to split.  The need for cleanup lock to remove dead
>> tuples
>> +is to ensure that scans' returns correct results.  Scan that returns
>> multiple
>> +tuples from the same bucket page always restart the scan from the
>> previous
>> +offset number from which it has returned last tuple.
>>
>> Perhaps it would be better to teach scans to restart anywhere on the page,
>> than to force more cleanup locks to be taken?
>
>
> Commenting on one of my own questions:
>
> This won't work when the vacuum removes the tuple which an existing scan is
> currently examining and thus will be used to re-find it's position when it
> realizes it is not visible and so takes up the scan again.
>
> The index tuples in a page are stored sorted just by hash value, not by the
> combination of (hash value, tid).  If they were sorted by both, we could
> re-find our position even if the tuple had been removed, because we would
> know to start at the slot adjacent to where the missing tuple would be were
> it not removed. But unless we are willing to break pg_upgrade, there is no
> feasible way to change that now.
>

I think it is possible without breaking pg_upgrade, if we match all
items of a page at once (and save them as local copy), rather than
matching item-by-item as we do now.  We are already doing similar for
btree, refer explanation of BTScanPosItem and BTScanPosData in
nbtree.h.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-14 Thread Amit Kapila
On Wed, Sep 14, 2016 at 4:36 PM, Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
> Hi All,
>
> Below is the backtrace for the issue reported in my earlier mail [1].
> From the callstack it looks like we are trying to release  lock on a
> meta page twice in _hash_expandtable().
>

Thanks for the call stack.  I think below code in patch is culprit.
Here we have already released the meta page lock and then again on
failure, we are trying to release it.

_hash_expandtable()
{
..

/* Release the metapage lock, before completing the split. */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
..

if (!buf_nblkno)
{
_hash_relbuf(rel, buf_oblkno);
goto fail;
}
..
fail:
/* We didn't write the metapage, so just drop lock */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
}

This is a problem of concurrent hash index patch.  I will send the fix
in next version of the patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-14 Thread Amit Kapila
On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersen
<jesper.peder...@redhat.com> wrote:
> On 09/13/2016 07:26 AM, Amit Kapila wrote:
>>
>> Attached, new version of patch which contains the fix for problem
>> reported on write-ahead-log of hash index thread [1].
>>
>
> I have been testing patch in various scenarios, and it has a positive
> performance impact in some cases.
>
> This is especially seen in cases where the values of the indexed column are
> unique - SELECTs can see a 40-60% benefit over a similar query using b-tree.
>

Here, I think it is better if we have the data comparing the situation
of hash index with respect to HEAD as well.  What I mean to say is
that you are claiming that after the hash index improvements SELECT
workload is 40-60% better, but where do we stand as of HEAD?

> UPDATE also sees an improvement.
>

Can you explain this more?  Is it more compare to HEAD or more as
compare to Btree?  Isn't this contradictory to what the test in below
mail shows?

> In cases where the indexed column value isn't unique, it takes a long time
> to build the index due to the overflow page creation.
>
> Also in cases where the index column is updated with a high number of
> clients, ala
>
> -- ddl.sql --
> CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
> CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
> CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
> ANALYZE;
>
> -- test.sql --
> \set id random(1,10)
> \set val random(0,10)
> BEGIN;
> UPDATE test SET val = :val WHERE id = :id;
> COMMIT;
>
> w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for
> hash, and 10062 tps for b-tree).
>

Thanks for doing the tests.  Have you applied both concurrent index
and cache the meta page patch for these tests?  So from above tests,
we can say that after these set of patches read-only workloads will be
significantly improved even better than btree in quite-a-few useful
cases.  However when the indexed column is updated, there is still a
large gap as compare to btree (what about the case when the indexed
column is not updated in read-write transaction as in our pgbench
read-write transactions, by any chance did you ran any such test?).  I
think we need to focus on improving cases where index columns are
updated, but it is better to do that work as a separate patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-13 Thread Amit Kapila
On Tue, Sep 13, 2016 at 5:46 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Sep 8, 2016 at 12:32 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Hmm.  I think page or block is a concept of database systems and
>> buckets is a general concept used in hashing technology.  I think the
>> difference is that there are primary buckets and overflow buckets. I
>> have checked how they are referred in one of the wiki pages [1],
>> search for overflow on that wiki page. Now, I think we shouldn't be
>> inconsistent in using them. I will change to make it same if I find
>> any inconsistency based on what you or other people think is the
>> better way to refer overflow space.
>
> In the existing source code, the terminology 'overflow page' is
> clearly preferred to 'overflow bucket'.
>

Okay, point taken.  Will update it in next version of patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _hash_alloc_buckets() safety

2016-09-13 Thread Amit Kapila
On Tue, Sep 13, 2016 at 6:58 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>> While working on write-ahead-logging of hash indexes, I noticed that
>> this function allocates buckets in batches and the mechanism it uses
>> is that it initialize the last page of batch with zeros and expect
>> that the filesystem will ensure the intervening pages read as zeroes
>> too.
>
> Yes.  AFAIK that filesystem behavior is required by POSIX.
>
>> I think to make it WAL enabled, we need to initialize the page header
>> (using PageInit() or equivalent) instead of initializing it with
>> zeroes as some part of our WAL replay machinery expects that the page
>> should not be new as indicated by me in other thread [1].
>
> I don't really see why that's a problem.  The only way one of the fill
> pages would get to be not-zero is if there is a WAL action later in the
> stream that overwrites it.  So how would things become inconsistent?
>

It could lead to an error/log message indicating "unexpected zero
page". Refer below code in XLogReadBufferExtended():

if (mode == RBM_NORMAL)
{

/* check that page has been initialized */
Page page = (Page) BufferGetPage(buffer);
..
if (PageIsNew(page))
{
..
log_invalid_page(rnode, forknum, blkno, true);
}

Now, during replay of XLOG_FPI (WAL record for new page log_newpage),
it won't hit that condition because it is ensured in the caller that
if block has image, than we use RBM_ZERO_AND_CLEANUP_LOCK or
RBM_ZERO_AND_LOCK buffer mode.  However, a generic reader of blocks
like pgstattuple
(pgstattuple->pgstat_hash_page->_hash_getbuf_with_strategy->_hash_checkpage())
or new check consistency tool [1] being developed by Kuntal would lead
to an error/log - " .. contains unexpected zero page at block ..".  I
think the check consistency tool might use some checks to avoid it,
but not sure.


>> Offhand, I don't see any problem with just
>> initializing the last page and write the WAL for same with
>> log_newpage(), however if we try to initialize all pages, there could
>> be some performance penalty on split operation.
>
> "Some" seems like rather an understatement.  And it's not just the
> added I/O, it's the fact that you'd need to lock each bucket as you
> went through them to avoid clobbering concurrently-inserted data.
>

The question of concurrent-data insertion can only come into picture
when we update metapage information(maxbucket, lowmask, highmask, ..).
Without that, I don't think concurrent users would ever be aware of
the new buckets.  However, as this work is done with metapage locked,
it is not advisable to do more work here, unless it is required.

> If you weren't talking about such an enormous penalty, I might be okay
> with zeroing the intervening pages explicitly rather than depending on
> the filesystem to do it.  But since you are, I think you need a clearer
> explanation of why this is necessary.
>

I am also not in favor of doing the extra work unless it is required.
The point of this mail is to check if anybody sees that it is
required, because I can't find a reason to initialize all pages.


[1] - 
https://www.postgresql.org/message-id/CAGz5QCLC=0jbmgag-barpyaavu9cn0b-scoepecgbe6rhh8...@mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] _hash_alloc_buckets() safety

2016-09-13 Thread Amit Kapila
While working on write-ahead-logging of hash indexes, I noticed that
this function allocates buckets in batches and the mechanism it uses
is that it initialize the last page of batch with zeros and expect
that the filesystem will ensure the intervening pages read as zeroes
too.

I think to make it WAL enabled, we need to initialize the page header
(using PageInit() or equivalent) instead of initializing it with
zeroes as some part of our WAL replay machinery expects that the page
should not be new as indicated by me in other thread [1].  I think WAL
consistency check tool [2] also uses same part of replay functions and
will show this as problem, if we don't initialize the page header.

The point which is not clear to me is that whether it is okay as-is or
shall we try to initialize each page of batch during
_hash_alloc_buckets() considering now we are trying to make hash
indexes WAL enabled.  Offhand, I don't see any problem with just
initializing the last page and write the WAL for same with
log_newpage(), however if we try to initialize all pages, there could
be some performance penalty on split operation.

Thoughts?


[1] - 
https://www.postgresql.org/message-id/CAA4eK1JS%2BSiRSQBzEFpnsSmxZKingrRH7WNyWULJeEJSj1-%3D0w%40mail.gmail.com
[2] - https://commitfest.postgresql.org/10/741/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-13 Thread Amit Kapila
Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JuKt%3D-%3DY0FheiFL-i8Z5_5660%3D3n8JUA8s3zG53t_ArQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


concurrent_hash_index_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-13 Thread Amit Kapila
On Mon, Sep 12, 2016 at 11:29 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>
>
> My test program (as posted) injects crashes and then checks the
> post-crash-recovery system for consistency, so it cannot be run as-is
> without the WAL patch.  I also ran the test with crashing turned off (just
> change the JJ* variables at the stop of the do.sh to all be set to the empty
> string), and in that case I didn't see either problem, but it it could just
> be that I that I didn't run it long enough.
>
> It should have been long enough to detect the rather common "buffer  is
> not owned by resource owner Portal" problem, so that one I think is specific
> to the WAL patch (probably the part which tries to complete bucket splits
> when it detects one was started but not completed?)
>

Both the problem seems to be due to same reason and in concurrent hash
index patch.  So what is happening here is that we are trying to unpin
the old bucket buffer twice.  We need to scan the old bucket when
there is a incomplete-split, so the issue here is that during split
the system has crashed and after restart, during old bucket scan it
tries to unpin the old primary bucket buffer twice when the new bucket
has additional overflow pages.  I will post the updated patch on
concurrent hash index thread.  Once, I post the patch, it is better if
you try to reproduce the issue once more.

Thanks to Ashutosh Sharma who has offlist shared the call stack after
reproducing the problem.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-12 Thread Amit Kapila
On Tue, Sep 13, 2016 at 3:58 AM, Mark Kirkwood
<mark.kirkw...@catalyst.net.nz> wrote:
> On 13/09/16 01:20, Jesper Pedersen wrote:
>>
>> On 09/01/2016 11:55 PM, Amit Kapila wrote:
>>>
>>> I have fixed all other issues you have raised.  Updated patch is
>>> attached with this mail.
>>>
>>
>> The following script hangs on idx_val creation - just with v5, WAL patch
>> not applied.
>
>
> Are you sure it is actually hanging? I see 100% cpu for a few minutes but
> the index eventually completes ok for me (v5 patch applied to today's
> master).
>

It completed for me as well.  The second index creation is taking more
time and cpu, because it is just inserting duplicate values which need
lot of overflow pages.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-11 Thread Amit Kapila
On Sun, Sep 11, 2016 at 3:01 PM, Mark Kirkwood
<mark.kirkw...@catalyst.net.nz> wrote:
> On 11/09/16 19:16, Mark Kirkwood wrote:
>
>>
>>
>> On 11/09/16 17:01, Amit Kapila wrote:
>>>
>>> ...Do you think we can do some read-only
>>> workload benchmarking using this server?  If yes, then probably you
>>> can use concurrent hash index patch [1] and cache the metapage patch
>>> [2] (I think Mithun needs to rebase his patch) to do so.
>>>
>>>
>>>
>>> [1] -
>>> https://www.postgresql.org/message-id/caa4ek1j6b8o4pcepqrxnyblvbftonmjeem+qn0jzx31-obx...@mail.gmail.com
>>> [2] -
>>> https://www.postgresql.org/message-id/cad__ouhj29cebif_flge4t9vj_-cfxbwcxhjo+d_16txbem...@mail.gmail.com
>>>
>>>
>>
>> I can do - are we checking checking for hangs/assertions or comparing
>> patched vs unpatched performance (for the metapage patch)?
>>
>>

The first step is to find any bugs and then do performance testing.
What I wanted for performance testing was to compare HEAD against all
three patches (all three together) of Hash Index (concurrent hash
index, cache the meta page, WAL for hash index).  For Head, we want
two set of numbers, one with hash indexes and another with btree
indexes.  As Jeff has found few problems, I think it is better to
first fix those before going for performance tests.

>
> So, assuming the latter - testing performance with and without the metapage
> patch:
>
> For my 1st runs:
>
> - cpus 16, ran 16G
> - size 100, clients 32
>
> I'm seeing no difference in performance for read only (-S) pgbench workload
> (with everybody using has indexes). I guess not that surprising as the db
> fites in ram (1.6G and we have 16G). So I'll retry with a bigger dataset
> (suspect size 2000 is needed).
>

I think you should try with -S -M prepared and with various client
counts (8,16,32,64,100...).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-11 Thread Amit Kapila
On Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>
>>
>> I plan to do testing using my own testing harness after changing it to
>> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
>> column, which are never queried by the core part of the harness) and
>> deleting them at random intervals.  I think that none of pgbench's built in
>> tests are likely to give the bucket splitting and squeezing code very much
>> exercise.
>
>
>
> I've implemented this, by adding lines 197 through 202 to the count.pl
> script.  (I'm reattaching the test case)
>
> Within a few minutes of testing, I start getting Errors like these:
>
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR:  buffer 2762 is not
> owned by resource owner Portal
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT:  update foo set
> count=count+1 where index=$1
>
>
> In one test, I also got an error from my test harness itself indicating
> tuples are transiently missing from the index, starting an hour into a test:
>
> child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
>
> Those key values should always find exactly one row to update.
>
> If the tuples were permanently missing from the index, I would keep getting
> errors on the same key values very frequently.  But I don't get that, the
> errors remain infrequent and are on different value each time, so I think
> the tuples are in the index but the scan somehow misses them, either while
> the bucket is being split or while it is being squeezed.
>
> This on a build without enable-asserts.
>
> Any ideas on how best to go about investigating this?
>

I think these symptoms indicate the bug in concurrent hash index
patch, but it could be that the problem can be only revealed with WAL
patch.  Is it possible to just try this with concurrent hash index
patch?  In any case, thanks for testing it, I will look into these
issues.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    4   5   6   7   8   9   10   11   12   13   >