Re: [HACKERS] UPDATE of partition key

2017-05-18 Thread Amit Kapila
On Wed, May 17, 2017 at 4:05 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Wed, May 17, 2017 at 3:15 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> Earlier I thought that option1 is better but later I think that this
>>> can complicate the situation as we are firing first BR update then BR
>>> delete and can change the row multiple time and defining such
>>> behaviour can be complicated.
>>>
>>
>> If we have to go by this theory, then the option you have preferred
>> will still execute BR triggers for both delete and insert, so input
>> row can still be changed twice.
>
> Yeah, right as per my theory above Option3 have the same problem.
>
> But after putting some more thought I realised that only for "Before
> Update" or the "Before Insert" trigger row can be changed.
>

Before Row Delete triggers can suppress the delete operation itself
which is kind of unintended in this case.  I think without the user
being aware it doesn't seem advisable to execute multiple BR triggers.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-17 Thread Amit Kapila
 On Wed, May 17, 2017 at 5:17 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, May 17, 2017 at 6:29 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> I think we can do this even without using an additional infomask bit.
>> As suggested by Greg up thread, we can set InvalidBlockId in ctid to
>> indicate such an update.
>
> Hmm.  How would that work?
>

We can pass a flag say row_moved (or require_row_movement) to
heap_delete which will in turn set InvalidBlockId in ctid instead of
setting it to self. Then the ExecUpdate needs to check for the same
and return an error when heap_update is not successful (result !=
HeapTupleMayBeUpdated).  Can you explain what difficulty are you
envisioning?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-17 Thread Amit Kapila
On Tue, May 16, 2017 at 9:39 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Sun, May 7, 2017 at 11:54 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> I'm having second thoughts based on some more experimentation I did
>> this morning.  I'll update again once I've had a bit more time to poke
>> at it.
>
> So the issue that I noticed here is that this problem really isn't
> confined to abort processing.  If we ROLLBACK TO SAVEPOINT or ABORT
> TRANSACTION on an open connection, we really do not know what the
> state of that connection is until we get an acknowledgement that the
> command completed successfully.  The patch handles that.  However, the
> same is true if we're sending a SAVEPOINT or RELEASE SAVEPOINT
> command, and the patch that I posted doesn't do anything about those
> cases.  I think it would be best to fix all transaction control
> commands in a symmetric manner.
>

+1.  Why not similar behavior for any other statements executed in
this module by do_sql_command?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-17 Thread Amit Kapila
On Mon, May 15, 2017 at 5:28 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, May 12, 2017 at 3:07 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> I agree with you that it might not be straightforward to make it work,
>> but now that earliest it can go is v11, do we want to try doing
>> something other than just documenting it.  What I could read from this
>> e-mail thread is that you are intending towards just documenting it
>> for the first cut of this feature. However, both Greg and Simon are of
>> opinion that we should do something about this and even patch Author
>> (Amit Khandekar) has shown some inclination to do something about this
>> point (return error to the user in some way), so I think we can't
>> ignore this point.
>>
>> I think now that we have some more time, it is better to try something
>> based on a couple of ideas floating in this thread to address this
>> point and see if we can come up with something doable without a big
>> architecture change.
>>
>> What is your take on this point now?
>
> I still don't think it's worth spending a bit on this, especially not
> with WARM probably gobbling up multiple bits.  Reclaiming the bits
> seems like a good idea, but spending one on this still seems to me
> like it's probably not the best use of our increasingly-limited supply
> of infomask bits.
>

I think we can do this even without using an additional infomask bit.
As suggested by Greg up thread, we can set InvalidBlockId in ctid to
indicate such an update.

>  Now, Simon and Greg may still feel otherwise, of
> course.
>
> I could get behind providing an option to turn this behavior on and
> off at the level of the partitioned table.
>

Sure that sounds like a viable option and we can set the default value
as false.  However, it might be better if we can detect the same
internally without big changes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-17 Thread Amit Kapila
On Wed, May 17, 2017 at 12:06 PM, Dilip Kumar <dilipbal...@gmail.com> wrote:
> On Fri, May 12, 2017 at 4:17 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> Option 3
>> 
>>
>> BR, AR delete triggers on source partition
>> BR, AR insert triggers on destination partition.
>>
>> Rationale :
>> Since the update is converted to delete+insert, just skip the update
>> triggers completely.
>
> +1 to option3
>
..
> Earlier I thought that option1 is better but later I think that this
> can complicate the situation as we are firing first BR update then BR
> delete and can change the row multiple time and defining such
> behaviour can be complicated.
>

If we have to go by this theory, then the option you have preferred
will still execute BR triggers for both delete and insert, so input
row can still be changed twice.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increasing parallel workers at runtime

2017-05-17 Thread Amit Kapila
On Tue, May 16, 2017 at 2:14 PM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Mon, May 15, 2017 at 9:23 PM, Robert Haas <robertmh...@gmail.com> wrote:
>
> Also, looking at the patch, it doesn't look like it take enough care
> to build execution state of new worker so that it can participate in a
> running query. I may be wrong, but the execution state initialization
> routines are written with the assumption that all the workers start
> simultaneously?
>

No such assumptions, workers started later can also join the execution
of the query.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increasing parallel workers at runtime

2017-05-17 Thread Amit Kapila
On Wed, May 17, 2017 at 5:45 AM, Haribabu Kommi
<kommi.harib...@gmail.com> wrote:
>
>
> On Wed, May 17, 2017 at 2:35 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>
>
> Ok. In this approach, we need to share some of the gatherstate structure
> members in the shared memory to access by the other background process
> or some better logic to update these details whenever a new worker gets
> allotted.
>

I don't think you need to share that stuff as we initialize the
dsm/queues based on planned workers.  I think one thing you might want
to consider is to see if this new background worker can detect whether
the query has already finished before launching workers.  Yet another
way to look at this problem is to have an idea of Alternative
non-parallel plans corresponding to parallel plans.  As of now, we
only have one plan during execution of parallel plan and if we don't
have enough workers, we have no choice but to proceed with that plan,
however, if we have some Alternative plan which is non-parallel, it
might be better to use the same if we can't allocate enough number of
workers for the query.  IIRC, this has been discussed previously
during the development of Parallel Sequential Scan patch and I think
some other databases uses some similar technique for this problem.




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bumping HASH_VERSION to 3

2017-05-16 Thread Amit Kapila
On Tue, May 16, 2017 at 5:14 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, May 16, 2017 at 7:31 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>
>> I will send an updated patch once we agree on above points.
>
> Sounds good.
>

Attached patch addresses all the comments as discussed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pg_upgrade_invalidate_old_hash_index_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NOT NULL constraints on range partition key columns

2017-05-16 Thread Amit Kapila
On Tue, May 16, 2017 at 3:26 PM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/05/16 4:29, Robert Haas wrote:
>> On Mon, May 15, 2017 at 9:12 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> Can't we allow NULL to get inserted into the partition (leaf
>>> partition) if the user uses the partition name in Insert statement?
>>
>> That would be terrible behavior - the behavior of tuple routing should
>> match the enforced constraints.
>>

Right and that makes sense.

>>> For root partitions, I think for now giving an error is okay, but once
>>> we have default partitions (Rahila's patch), we can route NULLS to
>>> default partition.
>>
>> Yeah, that's exactly why I think we should make the change Amit is
>> proposing here.  If we don't, then we won't be able to accept NULL
>> values even after we have the default partitioning stuff.
>
> Attached is a patch for consideration.  There are 2 actually, but maybe
> they should be committed together if we decide do go with this.
>

After your changes in get_qual_for_range(), below comment in that
function needs an update and I feel it is better to update the
function header as well.

/*
* A range-partitioned table does not allow partition keys to be null. For
* simple columns, their NOT NULL constraint suffices for the enforcement
* of non-nullability.  But for the expression keys, which are still
* nullable, we must emit a IS NOT NULL expression.  Collect them in
* result first.
*/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bumping HASH_VERSION to 3

2017-05-16 Thread Amit Kapila
On Mon, May 15, 2017 at 8:57 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, May 15, 2017 at 12:08 AM, Noah Misch <n...@leadboat.com> wrote:
>> The above-described topic is currently a PostgreSQL 10 open item.  Robert,
>> since you committed the patch believed to have created it, you own this open
>> item.  If some other commit is more relevant or if this does not belong as a
>> v10 open item, please let us know.  Otherwise, please observe the policy on
>> open item ownership[1] and send a status update within three calendar days of
>> this message.  Include a date for your subsequent status update.  Testers may
>> discover new open items at any time, and I want to plan to get them all fixed
>> well in advance of shipping v10.  Consequently, I will appreciate your 
>> efforts
>> toward speedy resolution.  Thanks.
>
> I don't believe this patch can be committed to beta1 which wraps in
> just a few hours; it seems to need a bit of work.  I'll update again
> by Friday based on how review unfolds this week.  I anticipate
> committing something on Wednesday or Thursday unless Bruce picks this
> one up.
>
> +/* find hash and gin indexes */
> +res = executeQueryOrDie(conn,
> +"SELECT n.nspname, c.relname "
> +"FROM   pg_catalog.pg_class c, "
> +"   pg_catalog.pg_index i, "
> +"   pg_catalog.pg_am a, "
> +"   pg_catalog.pg_namespace n "
> +"WHERE  i.indexrelid = c.oid AND "
> +"   c.relam = a.oid AND "
> +"   c.relnamespace = n.oid AND "
> +"   a.amname = 'hash'"
>
> Comment doesn't match code.
>
> +if (!check_mode && db_used)
> +/* mark hash indexes as invalid */
> +PQclear(executeQueryOrDie(conn,
>
> Given that we have a comment here, I'd put curly braces around this block.
>

Okay, will change.

> +snprintf(output_path, sizeof(output_path), "reindex_hash.sql");
>
> This looks suspiciously pointless.  The contents of output_path will
> always be precisely "reindex_hash.sql"; you could just char
> *output_path = "reindex_hash.sql" and get the same effect.
>

Sure, but the same code pattern is used in all other similar functions
in version.c, refer new_9_0_populate_pg_largeobject_metadata.  Both
the ways will serve the purpose, do you think it makes sense to write
this differently?

>  Compare
> this to the logic in create_script_for_cluster_analyze, which prepends
> SCRIPT_PREFIX.

That is required for .sh or .bat files not for .sql files.  I think we
need to compare it with logic in
new_9_0_populate_pg_largeobject_metadata.

>  (I am not sure why it's necessary to prepend "./" on
> Windows, but if it's needed in that case, maybe it's needed in this
> case, too.)
>

Hmm, "./" is required for non-windows, but as mentioned above, this is
not required for our case.

> +start_postmaster(_cluster, true);
> +old_9_6_invalidate_hash_indexes(_cluster, false);
> +stop_postmaster(false);
>
> Won't this cause the hash indexes to be invalided in the old cluster
> rather than the new one?
>

oops. copy-paste.  It passed in my testing because I have not used any
different options (like port number) for old or new server.

> This might need a visit from pgindent in one or two places, too.
>

I have run pgindent before sending the previous version, but will
again verify the same.

I will send an updated patch once we agree on above points.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] NOT NULL constraints on range partition key columns

2017-05-15 Thread Amit Kapila
On Mon, May 15, 2017 at 11:46 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> Starting a new thread to discuss the proposal I put forward in [1] to stop
> creating explicit NOT NULL constraint on range partition keys that are
> simple columns.  I said the following:
>
> On 2017/05/12 11:20, Robert Haas wrote:
>> On Thu, May 11, 2017 at 10:15 PM, Amit Langote
>> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> On 2017/05/12 10:42, Robert Haas wrote:
>>>> Actually, I think that not supporting nulls for range partitioning may
>>>> have been a fairly bad decision.
>>>
>>> I think the relevant discussion concluded [1] that way, because we
>>> couldn't decide which interface to provide for specifying where NULLs are
>>> placed or because we decided to think about it later.
>>
>> Yeah, but I have a feeling that marking the columns NOT NULL is going
>> to make it really hard to support that in the future when we get the
>> syntax hammered out.  If it had only affected the partition
>> constraints that'd be different.
>
> So, adding keycol IS NOT NULL (like we currently do for expressions) in
> the implicit partition constraint would be more future-proof than
> generating an actual catalogued NOT NULL constraint on the keycol?  I now
> tend to think it would be better.  Directly inserting into a range
> partition with a NULL value for a column currently generates a "null value
> in column \"%s\" violates not-null constraint" instead of perhaps more
> relevant "new row for relation \"%s\" violates partition constraint".
> That said, we *do* document the fact that a NOT NULL constraint is added
> on range key columns, but we might as well document instead that we don't
> currently support routing tuples with NULL values in the partition key
> through a range-partitioned table and so NULL values cause error.
>

Can't we allow NULL to get inserted into the partition (leaf
partition) if the user uses the partition name in Insert statement?
For root partitions, I think for now giving an error is okay, but once
we have default partitions (Rahila's patch), we can route NULLS to
default partition.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v2] Progress command to monitor progression of long running SQL queries

2017-05-13 Thread Amit Kapila
On Wed, May 10, 2017 at 10:10 PM, Remi Colinet <remi.coli...@gmail.com> wrote:
>
> Parallel queries can also be monitored. The same mecanism is used to monitor
> child workers with a slight difference: the main worker requests the child
> progression directly in order to dump the whole progress tree in shared
> memory.
>

What if there is any error in the worker (like "out of memory") while
gathering the statistics?  It seems both for workers as well as for
the main backend it will just error out.  I am not sure if it is a
good idea to error out the backend or parallel worker as it will just
end the query execution.  Also, even if it is okay, there doesn't seem
to be a way by which a parallel worker can communicate the error back
to master backend, rather it will just exit silently which is not
right.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2017-05-13 Thread Amit Kapila
On Fri, May 12, 2017 at 9:14 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>>> ... I'd like to propose to change relation
>>> extension lock management so that it works using LWLock instead.
>
>> That's not a good idea because it'll make the code that executes while
>> holding that lock noninterruptible.
>
> Is that really a problem?  We typically only hold it over one kernel call,
> which ought to be noninterruptible anyway.
>

During parallel bulk load operations, I think we hold it over multiple
kernel calls.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2017-05-13 Thread Amit Kapila
On Thu, May 11, 2017 at 6:09 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> This work would be helpful not only for existing workload but also
> future works like some parallel utility commands, which is discussed
> on other threads[1]. At least for parallel vacuum, this feature helps
> to solve issue that the implementation of parallel vacuum has.
>
> I ran pgbench for 10 min three times(scale factor is 5000), here is a
> performance measurement result.
>
> clients   TPS(HEAD)   TPS(Patched)
> 4   2092.612   2031.277
> 8   3153.732   3046.789
> 16 4562.072   4625.419
> 32 6439.391   6479.526
> 64 7767.364   7779.636
> 100   7917.173   7906.567
>
> * 16 core Xeon E5620 2.4GHz
> * 32 GB RAM
> * ioDrive
>
> In current implementation, it seems there is no performance degradation so 
> far.
>

I think it is good to check pgbench, but we should do tests of the
bulk load as this lock is stressed during such a workload.  Some of
the tests we have done when we have improved the performance of bulk
load can be found in an e-mail [1].

[1] -
https://www.postgresql.org/message-id/CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-05-12 Thread Amit Kapila
On Sat, May 13, 2017 at 1:08 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, May 12, 2017 at 2:45 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> Maybe a shorter argument for hash partitioning is that not one but two
> different people proposed patches for it within months of the initial
> partitioning patch going in.  When multiple people are thinking about
> implementing the same feature almost immediately after the
> prerequisite patches land, that's a good clue that it's a desirable
> feature.  So I think we should try to solve the problems, rather than
> giving up.
>

Can we think of defining separate portable hash functions which can be
used for the purpose of hash partitioning?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-12 Thread Amit Kapila
On Fri, May 12, 2017 at 10:49 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 12 May 2017 at 08:30, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan...@gmail.com> 
>> wrote:
>
>> If we try to compare it with the non-partitioned update,
>> there also it is internally a delete and insert operation, but we
>> don't fire triggers for those.
>
> For a non-partitioned table, the delete+insert is internal, whereas
> for partitioned table, it is completely visible to the user.
>

If the user has executed an update on root table, then it is
transparent.  I think we can consider it user visible only in case if
there is some user visible syntax like "Update ... Move Row If
Constraint Not Satisfied"

>>
>>>> (b) It seems inconsistent to consider behavior for row and statement
>>>> triggers differently
>>>
>>> I am not sure whether we should compare row and statement triggers.
>>> Statement triggers are anyway fired only per-statement, depending upon
>>> whether it is update or insert or delete. It has nothing to do with
>>> how the rows are modified.
>>>
>>
>> Okay.  The broader point I was trying to convey was that the way this
>> patch defines the behavior of triggers doesn't sound good to me.  It
>> appears to me that in this thread multiple people have raised points
>> around trigger behavior and you should try to consider those.
>
> I understand that there is no single solution which will provide
> completely intuitive trigger behaviour. Skipping BR delete trigger
> should be fine. But then for consistency, we should skip BR insert
> trigger as well, the theory being that the delete+insert are not fired
> by the user so we should not fire them. But I feel both should be
> fired to avoid any consequences unexpected to the user who has
> installed those triggers.
>
> The only specific concern of yours is that of firing *both* update as
> well as insert triggers on the same table, right ? My explanation for
> this was : we have done this before for UPSERT, and we had documented
> the same. We can do it here also.
>
>>  Apart from the options, Robert has suggested, another option could be that
>> we allow firing BR-AR update triggers for original partition and BR-AR
>> insert triggers for the new partition.  In this case, one can argue
>> that we have not actually updated the row in the original partition,
>> so there is no need to fire AR update triggers,
>
> Yes that's what I think. If there is no update happened, then AR
> update trigger should not be executed. AR triggers are only for
> scenarios where it is guaranteed that the DML operation has happened
> when the trigger is being executed.
>
>> but I feel that is what we do for non-partitioned table update and it should 
>> be okay here
>> as well.
>
> I don't think so. For e.g. if a BR trigger returns NULL, the update
> does not happen, and then the AR trigger does not fire as well. Do you
> see any other scenarios for non-partitioned tables, where AR triggers
> do fire when the update does not happen ?
>

No, but here also it can be considered as an update for original partition.

>
> Overall, I am also open to skipping both insert+delete BR trigger,
>

I think it might be better to summarize all the options discussed
including what the patch has and see what most people consider as
sensible.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-12 Thread Amit Kapila
On Fri, Feb 24, 2017 at 3:50 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Feb 24, 2017 at 3:24 PM, Simon Riggs <si...@2ndquadrant.com> wrote:
>
>> It is of course very good that we have something ready for this
>> release and can make a choice of what to do.
>>
>> Thoughts
>>
>> 1. Reuse the tuple state HEAP_MOVED_OFF which IIRC represent exactly
>> almost exactly the same thing. An UPDATE which gets to a
>> HEAP_MOVED_OFF tuple will know to re-find the tuple via the partition
>> metadata, or I might be persuaded that in-this-release it is
>> acceptable to fail when this occurs with an ERROR and a retryable
>> SQLCODE, since the UPDATE will succeed on next execution.
>
> I've got my doubts about whether we can make that bit work that way,
> considering that we still support pg_upgrade (possibly in multiple
> steps) from old releases that had VACUUM FULL.  We really ought to put
> some work into reclaiming those old bits, but there's probably no time
> for that in v10.
>

I agree with you that it might not be straightforward to make it work,
but now that earliest it can go is v11, do we want to try doing
something other than just documenting it.  What I could read from this
e-mail thread is that you are intending towards just documenting it
for the first cut of this feature. However, both Greg and Simon are of
opinion that we should do something about this and even patch Author
(Amit Khandekar) has shown some inclination to do something about this
point (return error to the user in some way), so I think we can't
ignore this point.

I think now that we have some more time, it is better to try something
based on a couple of ideas floating in this thread to address this
point and see if we can come up with something doable without a big
architecture change.

What is your take on this point now?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Kapila
On Fri, May 12, 2017 at 9:27 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> On 11 May 2017 at 17:24, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> Few comments:
>>> 1.
>>> Operating directly on partition doesn't allow update to move row.
>>> Refer below example:
>>> create table t1(c1 int) partition by range(c1);
>>> create table t1_part_1 partition of t1 for values from (1) to (100);
>>> create table t1_part_2 partition of t1 for values from (100) to (200);
>>> insert into t1 values(generate_series(1,11));
>>> insert into t1 values(generate_series(110,120));
>>>
>>> postgres=# update t1_part_1 set c1=122 where c1=11;
>>> ERROR:  new row for relation "t1_part_1" violates partition constraint
>>> DETAIL:  Failing row contains (122).
>>
>> Yes, as Robert said, this is expected behaviour. We move the row only
>> within the partition subtree that has the update table as its root. In
>> this case, it's the leaf partition.
>>
>
> Okay, but what is the technical reason behind it?  Is it because the
> current design doesn't support it or is it because of something very
> fundamental to partitions?
>

One plausible theory is that as Select's on partitions just returns
the rows of that partition, the update should also behave in same way.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Kapila
On Thu, May 11, 2017 at 5:45 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 11 May 2017 at 17:24, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Few comments:
>> 1.
>> Operating directly on partition doesn't allow update to move row.
>> Refer below example:
>> create table t1(c1 int) partition by range(c1);
>> create table t1_part_1 partition of t1 for values from (1) to (100);
>> create table t1_part_2 partition of t1 for values from (100) to (200);
>> insert into t1 values(generate_series(1,11));
>> insert into t1 values(generate_series(110,120));
>>
>> postgres=# update t1_part_1 set c1=122 where c1=11;
>> ERROR:  new row for relation "t1_part_1" violates partition constraint
>> DETAIL:  Failing row contains (122).
>
> Yes, as Robert said, this is expected behaviour. We move the row only
> within the partition subtree that has the update table as its root. In
> this case, it's the leaf partition.
>

Okay, but what is the technical reason behind it?  Is it because the
current design doesn't support it or is it because of something very
fundamental to partitions?  Is it because we can't find root partition
from leaf partition?

+ is_partitioned_table =
+ root_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
+
+ if (is_partitioned_table)
+ ExecSetupPartitionTupleRouting(
+ root_rel,
+ /* Build WITH CHECK OPTION constraints for leaf partitions */
+ ExecInitPartitionWithCheckOptions(mtstate, root_rel);
+ /* Build a projection for each leaf partition rel. */
+ ExecInitPartitionReturningProjection(mtstate, root_rel);
..
+ /* It's not a partitioned table after all; error out. */
+ ExecPartitionCheckEmitError(resultRelInfo, slot, estate);

When we are anyway going to give error if table is not a partitioned
table, then isn't it better to give it early when we first identify
that.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Kapila
On Thu, May 11, 2017 at 5:41 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 11 May 2017 at 17:23, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan...@gmail.com> 
>> wrote:
>>> On 4 March 2017 at 12:49, Robert Haas <robertmh...@gmail.com> wrote:
>>>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>>>> wrote:
>>>>> I think it does not make sense running after row triggers in case of
>>>>> row-movement. There is no update happened on that leaf partition. This
>>>>> reasoning can also apply to BR update triggers. But the reasons for
>>>>> having a BR trigger and AR triggers are quite different. Generally, a
>>>>> user needs to do some modifications to the row before getting the
>>>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>>>> for that. And we can't just silently skip this step only because the
>>>>> final row went into some other partition; in fact the row-movement
>>>>> itself might depend on what the BR trigger did with the row. Whereas,
>>>>> AR triggers are typically written for doing some other operation once
>>>>> it is made sure the row is actually updated. In case of row-movement,
>>>>> it is not actually updated.
>>>>
>>>> How about running the BR update triggers for the old partition and the
>>>> AR update triggers for the new partition?  It seems weird to run BR
>>>> update triggers but not AR update triggers.  Another option would be
>>>> to run BR and AR delete triggers and then BR and AR insert triggers,
>>>> emphasizing the choice to treat this update as a delete + insert, but
>>>> (as Amit Kh. pointed out to me when we were in a room together this
>>>> week) that precludes using the BEFORE trigger to modify the row.
>>>>
>>
>> I also find the current behavior with respect to triggers quite odd.
>> The two points that appears odd are (a) Executing both before row
>> update and delete triggers on original partition sounds quite odd.
> Note that *before* trigger gets fired *before* the update happens. The
> actual update may not even happen, depending upon what the trigger
> does. And then in our case, the update does not happen; not just that,
> it is transformed into delete-insert. So then we should fire
> before-delete trigger.
>

Sure, I am aware of that point, but it doesn't seem obvious that both
update and delete BR triggers get fired for original partition.  As a
developer, it might be obvious to you that as you have used delete and
insert interface, it is okay that corresponding BR/AR triggers get
fired, however, it is not so obvious for others, rather it appears
quite odd.  If we try to compare it with the non-partitioned update,
there also it is internally a delete and insert operation, but we
don't fire triggers for those.

>> (b) It seems inconsistent to consider behavior for row and statement
>> triggers differently
>
> I am not sure whether we should compare row and statement triggers.
> Statement triggers are anyway fired only per-statement, depending upon
> whether it is update or insert or delete. It has nothing to do with
> how the rows are modified.
>

Okay.  The broader point I was trying to convey was that the way this
patch defines the behavior of triggers doesn't sound good to me.  It
appears to me that in this thread multiple people have raised points
around trigger behavior and you should try to consider those.   Apart
from the options, Robert has suggested, another option could be that
we allow firing BR-AR update triggers for original partition and BR-AR
insert triggers for the new partition.  In this case, one can argue
that we have not actually updated the row in the original partition,
so there is no need to fire AR update triggers, but I feel that is
what we do for non-partitioned table update and it should be okay here
as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Kapila
On Wed, May 3, 2017 at 11:22 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 2 May 2017 at 18:17, Robert Haas <robertmh...@gmail.com> wrote:
>> On Tue, Apr 4, 2017 at 7:11 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>> wrote:
>>> Attached updated patch v7 has the above changes.
>>
>
> Attached is the rebased patch, which resolves the above conflicts.
>

Few comments:
1.
Operating directly on partition doesn't allow update to move row.
Refer below example:
create table t1(c1 int) partition by range(c1);
create table t1_part_1 partition of t1 for values from (1) to (100);
create table t1_part_2 partition of t1 for values from (100) to (200);
insert into t1 values(generate_series(1,11));
insert into t1 values(generate_series(110,120));

postgres=# update t1_part_1 set c1=122 where c1=11;
ERROR:  new row for relation "t1_part_1" violates partition constraint
DETAIL:  Failing row contains (122).

2.
-
+static void ExecInitPartitionWithCheckOptions(ModifyTableState *mtstate,
+  Relation root_rel);

Spurious line delete.

3.
+   longer satisfy the partition constraint of the containing partition. In that
+   case, if there is some other partition in the partition tree for which this
+   row satisfies its partition constraint, then the row is moved to that
+   partition. If there isn't such a partition, an error will occur.

Doesn't this error case indicate that this needs to be integrated with
Default partition patch of Rahila or that patch needs to take care
this error case?
Basically, if there is no matching partition, then move it to default partition.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-05-11 Thread Amit Kapila
On Fri, Mar 17, 2017 at 4:07 PM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> On 4 March 2017 at 12:49, Robert Haas <robertmh...@gmail.com> wrote:
>> On Thu, Mar 2, 2017 at 11:53 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>> wrote:
>>> I think it does not make sense running after row triggers in case of
>>> row-movement. There is no update happened on that leaf partition. This
>>> reasoning can also apply to BR update triggers. But the reasons for
>>> having a BR trigger and AR triggers are quite different. Generally, a
>>> user needs to do some modifications to the row before getting the
>>> final NEW row into the database, and hence [s]he defines a BR trigger
>>> for that. And we can't just silently skip this step only because the
>>> final row went into some other partition; in fact the row-movement
>>> itself might depend on what the BR trigger did with the row. Whereas,
>>> AR triggers are typically written for doing some other operation once
>>> it is made sure the row is actually updated. In case of row-movement,
>>> it is not actually updated.
>>
>> How about running the BR update triggers for the old partition and the
>> AR update triggers for the new partition?  It seems weird to run BR
>> update triggers but not AR update triggers.  Another option would be
>> to run BR and AR delete triggers and then BR and AR insert triggers,
>> emphasizing the choice to treat this update as a delete + insert, but
>> (as Amit Kh. pointed out to me when we were in a room together this
>> week) that precludes using the BEFORE trigger to modify the row.
>>

I also find the current behavior with respect to triggers quite odd.
The two points that appears odd are (a) Executing both before row
update and delete triggers on original partition sounds quite odd. (b)
It seems inconsistent to consider behavior for row and statement
triggers differently

>
> I checked the trigger behaviour in case of UPSERT. Here, when there is
> conflict found, ExecOnConflictUpdate() is called, and then the
> function returns immediately, which means AR INSERT trigger will not
> fire. And ExecOnConflictUpdate() calls ExecUpdate(), which means BR
> and AR UPDATE triggers will be fired. So in short, when an INSERT
> becomes an UPDATE, BR INSERT triggers do fire, but then the BR UPDATE
> and AR UPDATE also get fired. On the same lines, it makes sense in
> case of UPDATE=>DELETE+INSERT operation to fire a BR UPDATE trigger on
> the original table, and then the BR and AR DELETE/INSERT triggers on
> the respective tables.
>

I am not sure if it is good idea to compare it with "Insert On
Conflict Do Update", but  even if we want that way, I think Insert On
Conflict is consistent in statement level triggers which means it will
fire both Insert and Update statement level triggres (as per below
note in docs) whereas the documentation in the patch indicates that
this patch will only fire Update statement level triggers which is
odd.

Note in docs about Insert On Conflict
"Note that with an INSERT with an ON CONFLICT DO UPDATE clause, both
INSERT and UPDATE statement level trigger will be fired.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove pre-10 compatibility code in hash index

2017-05-10 Thread Amit Kapila
On Wed, May 10, 2017 at 9:17 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, May 9, 2017 at 1:41 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Commit ea69a0dead5128c421140dc53fac165ba4af8520 has bumped the hash
>> index version and obviates the need for backward compatibility code
>> added by commit 293e24e507838733aba4748b514536af2d39d7f2.  The same
>> has been mentioned in the commit message, please find attached patch
>> to remove the pre-10 compatibility code in hash index.
>
> Committed.
>

Thanks!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bumping HASH_VERSION to 3

2017-05-10 Thread Amit Kapila
On Tue, Apr 11, 2017 at 11:53 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Fri, Mar 31, 2017 at 02:48:05PM -0400, Tom Lane wrote:
>> +1, as long as we're clear on what will happen when pg_upgrade'ing
>> an installation containing hash indexes.  I think a minimum requirement is
>> that it succeed and be able to start up, and allow the user to manually
>> REINDEX such indexes afterwards.  Bonus points for:
>>
>> 1. teaching pg_upgrade to create a script containing the required REINDEX
>> commands.  (I think it's produced scripts for similar requirements in the
>> past.)
>>
>> 2. marking the index invalid so that the system would silently ignore it
>> until it's been reindexed.  I think there might be adequate infrastructure
>> for that already thanks to REINDEX CONCURRENTLY, and it'd just be a matter
>> of getting pg_upgrade to hack the indexes' catalog state.  (If not, it's
>> probably not worth the trouble.)
>
> We already have code to do all of that, but it was removed from
> pg_upgrade in 9.5.  You can still see it in 9.4:
>
> 
> contrib/pg_upgrade/version_old_8_3.c::old_8_3_invalidate_hash_gin_indexes()
>

Thanks for the pointer.

> I would be happy to restore that code and make it work for PG 10.
>

Attached patch implements the two points suggested by Tom. I will add
this to PG-10 open issues list so that we don't forget about this
work, let me know if you think otherwise.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pg_upgrade_invalidate_old_hash_index_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Remove pre-10 compatibility code in hash index

2017-05-08 Thread Amit Kapila
Commit ea69a0dead5128c421140dc53fac165ba4af8520 has bumped the hash
index version and obviates the need for backward compatibility code
added by commit 293e24e507838733aba4748b514536af2d39d7f2.  The same
has been mentioned in the commit message, please find attached patch
to remove the pre-10 compatibility code in hash index.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


remove_pre10_compat_hash_index_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-08 Thread Amit Kapila
On Mon, May 8, 2017 at 6:30 PM, Dmitriy Sarafannikov
<dsarafanni...@yandex.ru> wrote:
>
> I think we can use RecentGlobalDataXmin for non-catalog relations and
> RecentGlobalXmin for catalog relations (probably a check similar to
> what we have in heap_page_prune_opt).
>
>
> I took check from heap_page_prune_opt (Maybe this check must be present as
> separate function?)
> But it requires to initialize snapshot for specific relation.
>

+ else \
+ (snapshotdata).xmin = \
+ TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin, \
+ relation); \

I think we don't need to use TransactionIdLimitedForOldSnapshots() as
that is required to override xmin for table vacuum/pruning purposes.

> Maybe we need
> to use GetOldestXmin()?

It is a costly call as it needs ProcArrayLock, so I don't think it
makes sense to use it here.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Atomics for heap_parallelscan_nextpage()

2017-05-08 Thread Amit Kapila
On Sat, May 6, 2017 at 7:27 PM, David Rowley
<david.row...@2ndquadrant.com> wrote:
>
> I've also noticed that both the atomics patch and unpatched master do
> something that looks a bit weird with synchronous seq-scans. If the
> parallel seq-scan piggybacked on another scan, then subsequent
> parallel scans will start at the same non-zero block location, even
> when no other concurrent scans exist.
>

That is what we expect.  The same is true for non-parallel scans as
well, refer below code for non-parallel scans.

heapgettup_pagemode()
{
..

finished = (page == scan->rs_startblock) ||
(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);


/*
* Report our new scan position for synchronization purposes. We
* don't do that when moving backwards, however. That would just
* mess up any other forward-moving scanners.
*
* Note: we do this before checking for end of scan so that the
* final state of the position hint is back at the start of the
* rel.  That's not strictly necessary, but otherwise when you run
* the same query multiple times the starting position would shift
* a little bit backwards on every invocation, which is confusing.
* We don't guarantee any specific ordering in general, though.
*/
if (scan->rs_syncscan)
ss_report_location(scan->rs_rd, page);
..
}


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-06 Thread Amit Kapila
On Sat, May 6, 2017 at 10:41 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Sat, May 6, 2017 at 12:55 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> Oh!  Good catch.  Given that the behavior in question is intentional
>> there and intended to provide backward compatibility, changing it in
>> back-branches doesn't seem like a good plan.  I'll adjust the patch so
>> that it continues to ignore errors in that case.
>
> Updated patch attached.
>

Patch looks good to me.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-06 Thread Amit Kapila
On Fri, May 5, 2017 at 1:28 PM, Dmitriy Sarafannikov
<dsarafanni...@yandex.ru> wrote:
> Amit, thanks for comments!
>
>> 1.
>> +#define InitNonVacuumableSnapshot(snapshotdata)  \
>> + do { \
>> + (snapshotdata).satisfies = HeapTupleSatisfiesNonVacuumable; \
>> + (snapshotdata).xmin = RecentGlobalDataXmin; \
>> + } while(0)
>> +
>>
>> Can you explain and add comments why you think RecentGlobalDataXmin is
>> the right to use it here?  As of now, it seems to be only used for
>> pruning non-catalog tables.
>
> Can you explain me, what value for xmin should be used there?
>

I think we can use RecentGlobalDataXmin for non-catalog relations and
RecentGlobalXmin for catalog relations (probably a check similar to
what we have in heap_page_prune_opt).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-05 Thread Amit Kapila
On Sat, May 6, 2017 at 4:41 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, May 5, 2017 at 9:32 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Fri, May 5, 2017 at 11:15 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> I am not saying to rip those changes.  Your most of the mail stresses
>>> about the 30-second timeout which you have added in the patch to
>>> detect timeouts during failures (rollback processing).  I have no
>>> objection to that part of the patch except that still there is a race
>>> condition around PQsendQuery and you indicate that it is better to
>>> deal it as a separate patch to which I agree.
>>
>> OK.
>>
>>> The point of which I am
>>> not in total agreement is about the failure other than the time out.
>>>
>>> +pgfdw_exec_cleanup_query(PGconn *conn, const char *query)
>>> {
>>> ..
>>> + /* Get the result of the query. */
>>> + if (pgfdw_get_cleanup_result(conn, endtime, ))
>>> + return false;
>>> +
>>> + /* Issue a warning if not successful. */
>>> + if (PQresultStatus(result) != PGRES_COMMAND_OK)
>>> + {
>>> + pgfdw_report_error(WARNING, result, conn, true, query);
>>> + return false;
>>> + }
>>> ..
>>> }
>>>
>>> In the above code, if there is a failure due to timeout then it will
>>> return from first statement (pgfdw_get_cleanup_result), however, if
>>> there is any other failure, then it will issue a warning and return
>>> false.  I am not sure if there is a need to return false in the second
>>> case, because even without that it will work fine (and there is a
>>> chance that it won't drop the connection), but maybe your point is
>>> better to be consistent for all kind of error handling in this path.
>>
>> There are three possible queries that can be executed by
>> pgfdw_exec_cleanup_query; let's consider them individually.
>>
>> 1. ABORT TRANSACTION.  If ABORT TRANSACTION fails, I think that we
>> have no alternative but to close the connection.  If we don't, then
>> the next local connection that tries to use this connection will
>> probably also fail, because it will find the connection inside an
>> aborted transaction rather than not in a transaction.  If we do close
>> the connection, the next transaction will try to reconnect and
>> everything will be fine.
>>
>> 2. DEALLOCATE ALL.  If DEALLOCATE ALL fails, we do not have to close
>> the connection, but the reason why we're running that statement in the
>> first place is because we've potentially lost track of which
>> statements are prepared on the remote side.  If we don't drop the
>> connection, we'll still be confused about that.  Reconnecting will fix
>> it.
>>
>
> There is a comment in the code which explains why currently we ignore
> errors from DEALLOCATE ALL.
>
> * DEALLOCATE ALL only exists in 8.3 and later, so this
> * constrains how old a server postgres_fdw can
> * communicate with.  We intentionally ignore errors in
> * the DEALLOCATE, so that we can hobble along to some
> * extent with older servers (leaking prepared statements
> * as we go; but we don't really support update operations
> * pre-8.3 anyway).
>
> It is not entirely clear to me whether it is only for Commit case or
> in general.  From the code, it appears that the comment holds true for
> both commit and abort.  If it is for both, then after this patch, the
> above comment will not be valid and you might want to update it in
> case we want to proceed with your proposed changes for DEALLOCATE ALL
> statement.
>

Apart from this, I don't see any other problem with the patch.
Additionally, I have run the regression tests with the patch and
everything passed.

As a side note, why we want 30-second timeout only for Rollback
related commands and why not for Commit and the Deallocate All as
well?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-05 Thread Amit Kapila
On Fri, May 5, 2017 at 9:32 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, May 5, 2017 at 11:15 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> I am not saying to rip those changes.  Your most of the mail stresses
>> about the 30-second timeout which you have added in the patch to
>> detect timeouts during failures (rollback processing).  I have no
>> objection to that part of the patch except that still there is a race
>> condition around PQsendQuery and you indicate that it is better to
>> deal it as a separate patch to which I agree.
>
> OK.
>
>> The point of which I am
>> not in total agreement is about the failure other than the time out.
>>
>> +pgfdw_exec_cleanup_query(PGconn *conn, const char *query)
>> {
>> ..
>> + /* Get the result of the query. */
>> + if (pgfdw_get_cleanup_result(conn, endtime, ))
>> + return false;
>> +
>> + /* Issue a warning if not successful. */
>> + if (PQresultStatus(result) != PGRES_COMMAND_OK)
>> + {
>> + pgfdw_report_error(WARNING, result, conn, true, query);
>> + return false;
>> + }
>> ..
>> }
>>
>> In the above code, if there is a failure due to timeout then it will
>> return from first statement (pgfdw_get_cleanup_result), however, if
>> there is any other failure, then it will issue a warning and return
>> false.  I am not sure if there is a need to return false in the second
>> case, because even without that it will work fine (and there is a
>> chance that it won't drop the connection), but maybe your point is
>> better to be consistent for all kind of error handling in this path.
>
> There are three possible queries that can be executed by
> pgfdw_exec_cleanup_query; let's consider them individually.
>
> 1. ABORT TRANSACTION.  If ABORT TRANSACTION fails, I think that we
> have no alternative but to close the connection.  If we don't, then
> the next local connection that tries to use this connection will
> probably also fail, because it will find the connection inside an
> aborted transaction rather than not in a transaction.  If we do close
> the connection, the next transaction will try to reconnect and
> everything will be fine.
>
> 2. DEALLOCATE ALL.  If DEALLOCATE ALL fails, we do not have to close
> the connection, but the reason why we're running that statement in the
> first place is because we've potentially lost track of which
> statements are prepared on the remote side.  If we don't drop the
> connection, we'll still be confused about that.  Reconnecting will fix
> it.
>

There is a comment in the code which explains why currently we ignore
errors from DEALLOCATE ALL.

* DEALLOCATE ALL only exists in 8.3 and later, so this
* constrains how old a server postgres_fdw can
* communicate with.  We intentionally ignore errors in
* the DEALLOCATE, so that we can hobble along to some
* extent with older servers (leaking prepared statements
* as we go; but we don't really support update operations
* pre-8.3 anyway).

It is not entirely clear to me whether it is only for Commit case or
in general.  From the code, it appears that the comment holds true for
both commit and abort.  If it is for both, then after this patch, the
above comment will not be valid and you might want to update it in
case we want to proceed with your proposed changes for DEALLOCATE ALL
statement.

> 3. ROLLBACK TO SAVEPOINT %d; RELEASE SAVEPOINT %d.  If this fails, the
> local toplevel transaction is doomed.  It would be wrong to try to
> commit on the remote side, because there could be remote changes made
> by the aborted subtransaction which must not survive, and the ROLLBACK
> TO SAVEPOINT %d command is essential in order for us to get rid of
> them, and that has already failed.  So we must eventually abort the
> remote transaction, which we can do either by closing the connection
> or by trying to execute ABORT TRANSACTION.  The former, however, is
> quick, and the latter is very possibly going to involve a long hang,
> so closing the connection seems better.
>
> In all of these cases, the failure of one of these statements seems to
> me to *probably* mean that the remote side is just dead, in which case
> dropping the connection is the only option.  But even if the remote
> side is still alive and the statement failed for some other reason --
> say somebody changed kwlist.h so that ABORT now has to be spelled
> ABURT -- returning true rather than false just causes us to end up
> with a remote connection whose state is out of step with the local
> server state, and such a connection is not going to be usable.  I
> think part of your point is that in case 3, we might still get back to
> a usable connection if ABORT TRANSACTION is successf

Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-05 Thread Amit Kapila
On Fri, May 5, 2017 at 7:44 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, May 5, 2017 at 1:01 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> I have tried to verify above point and it seems to me in such a
>> situation the transaction status will be either PQTRANS_INTRANS or
>> PQTRANS_INERROR.  I have tried to mimic "out of memory" failure by
>> making PQmakeEmptyPGresult return NULL (malloc failure).  AFAIU, it
>> will make the state as PQTRANS_IDLE only when the statement is
>> executed successfully.
>
> Hrm.  OK, you might be right, but still I don't see why having the 30
> second timeout is bad.  People don't want a transaction rollback to
> hang for a long period of time waiting for ABORT TRANSACTION to run if
> we could've just dropped the connection to get the same effect.  Sure,
> the next transaction will then have to reconnect, but that takes a few
> tens of milliseconds; if you wait for one extra second to avoid that
> outcome, you come out behind even if you do manage to avoid the
> reconnect.  Now for a subtransaction you could argue that it's worth
> being more patient, because forcing a toplevel abort sucks.  But what
> are the chances that, after failing to respond to a ROLLBACK; RELEASE
> SAVEPOINT within 30 seconds, the remote side is going to wake up again
> and start behaving normally?  I'd argue that it's not terribly likely
> and most people aren't going to want to wait for multiple minutes to
> see whether it does, but I suppose we could impose the 30 second
> timeout only at the toplevel and let subtransactions wait however long
> the operating system takes to time out.
>
>>>> Also, does it matter if pgfdw_subxact_callback fails
>>>> during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs
>>>> to execute rollback to savepoint command before proceeding and that
>>>> should be good enough?
>>>
>>> I don't understand this.  If pgfdw_subxact_callback doesn't roll the
>>> remote side back to the savepoint, how is it going to get done later?
>>> There's no code in postgres_fdw other than that code to issue the
>>> rollback.
>>>
>> It is done via toplevel transaction ABORT.  I think how it works is
>> after "ROLLBACK TO SAVEPOINT" failure, any next execution is going to
>> fail.  Let us say if we issue Commit after failure, it will try to
>> execute RELEASE SAVEPOINT which will fail and lead to toplevel
>> transaction ABORT.  Now if the toplevel transaction ABORT also fails,
>> it will drop the connection.
>
> Mmmph.  I guess that would work, but I think it's better to track it
> explicitly.  I tried this (bar is a foreign table):
>
> begin;
> select * from bar;
> savepoint s1;
> select * from bar;
> savepoint s2;
> select * from bar;
> savepoint s3;
> select * from bar;
> commit;
>
> On the remote side, the commit statement produces this:
>
> 2017-05-05 09:28:52.597 EDT [80263] LOG:  statement: RELEASE SAVEPOINT s4
> 2017-05-05 09:28:52.597 EDT [80263] LOG:  statement: RELEASE SAVEPOINT s3
> 2017-05-05 09:28:52.597 EDT [80263] LOG:  statement: RELEASE SAVEPOINT s2
> 2017-05-05 09:28:52.597 EDT [80263] LOG:  statement: COMMIT TRANSACTION
>
> But that's obviously silly.  Somebody could easily come along and
> optimize that by getting rid of the RELEASE SAVEPOINT commands which
> are completely unnecessary if we're going to commit the toplevel
> transaction anyway, and then the argument that you're making wouldn't
> be true any more.  It seems smart to me to explicitly track whether
> the remote side is known to be in a bad state rather than relying on
> the future sequence of commands to be such that we'll get the right
> result anyway.
>

Sure, tracking explicitly might be a better way to do deal with it,
but at the cost of always dropping the connection in case of error
might not be the best thing to do.

>> Agreed, in such a situation toplevel transaction abort is the only way
>> out.  However, it seems to me that will anyway happen in current code
>> even if we don't try to force it via abort_cleanup_failure.  I
>> understand that it might be better to force it as you have done in the
>> patch, but not sure if it is good to drop the connection where
>> previously it could have done without it (for ex. if the first
>> statement Rollback To Savepoint fails, but ABORT succeeds).  I feel it
>> is worth considering whether such a behavior difference is okay as you
>> have proposed to backpatch it.
>
> Well, I think the whole point of this patch is that if one command on
> a connection fails, the chances of future commands succeeding 

Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-05 Thread Amit Kapila
On Thu, May 4, 2017 at 9:42 PM, Dmitriy Sarafannikov
<dsarafanni...@yandex.ru> wrote:
>
>> Maybe we need another type of snapshot that would accept any
>> non-vacuumable tuple.  I really don't want SnapshotAny semantics here,
>> but a tuple that was live more recently than the xmin horizon seems
>> like it's acceptable enough.  HeapTupleSatisfiesVacuum already
>> implements the right behavior, but we don't have a Snapshot-style
>> interface for it.
>
>
> I have tried to implement this new type of snapshot that accepts any
> non-vacuumable tuples.
> We have tried this patch in our load environment. And it has smoothed out
> and reduced magnitude of the cpu usage peaks.
> But this snapshot does not solve the problem completely.
>
> Patch is attached.

1.
+#define InitNonVacuumableSnapshot(snapshotdata)  \
+ do { \
+ (snapshotdata).satisfies = HeapTupleSatisfiesNonVacuumable; \
+ (snapshotdata).xmin = RecentGlobalDataXmin; \
+ } while(0)
+

Can you explain and add comments why you think RecentGlobalDataXmin is
the right to use it here?  As of now, it seems to be only used for
pruning non-catalog tables.

2.
+bool
+HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer)
+{
+ return HeapTupleSatisfiesVacuum(htup, snapshot->xmin, buffer)
+ != HEAPTUPLE_DEAD;
+}
+

Add comments on top of this function and for the sake of consistency
update the file header as well (Summary of visibility functions:)

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change GetLastImportantRecPtr's definition? (wasSkip checkpoints, archiving on idle systems.)

2017-05-05 Thread Amit Kapila
On Fri, May 5, 2017 at 11:57 AM, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> On 2017-05-05 11:50:12 +0530, Amit Kapila wrote:
>> I see that EndPos can be updated in one of the cases after releasing
>> the lock (refer below code). Won't that matter?
>
> I can't see how it'd in the cases that'd matter for
> GetLastImportantRecPtr() - but it'd probably good to note it in the
> comment.
>

I think it should matter for any record which is not tagged as
XLOG_MARK_UNIMPORTANT, but maybe I am missing something in which case
comment should be fine.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change GetLastImportantRecPtr's definition? (wasSkip checkpoints, archiving on idle systems.)

2017-05-05 Thread Amit Kapila
On Fri, May 5, 2017 at 11:43 AM, Andres Freund <and...@anarazel.de> wrote:
> On 2017-05-05 11:04:14 +0530, Amit Kapila wrote:
>> On Fri, May 5, 2017 at 6:54 AM, Andres Freund <and...@anarazel.de> wrote:
>> > Hi,
>> >
>> > On 2016-12-22 19:33:30 +, Andres Freund wrote:
>> >> Skip checkpoints, archiving on idle systems.
>> >
>> > As part of an independent bugfix I noticed that Michael & I appear to
>> > have introduced an off-by-one here. A few locations do comparisons like:
>> > /*
>> >  * Only log if enough time has passed and interesting records 
>> > have
>> >  * been inserted since the last snapshot.
>> >  */
>> > if (now >= timeout &&
>> > last_snapshot_lsn < GetLastImportantRecPtr())
>> > {
>> > last_snapshot_lsn = LogStandbySnapshot();
>> > ...
>> >
>> > which looks reasonable on its face.  But LogStandbySnapshot (via 
>> > XLogInsert())
>> >  * Returns XLOG pointer to end of record (beginning of next record).
>> >  * This can be used as LSN for data pages affected by the logged action.
>> >  * (LSN is the XLOG point up to which the XLOG must be flushed to disk
>> >  * before the data page can be written out.  This implements the basic
>> >  * WAL rule "write the log before the data".)
>> >
>> > and GetLastImportantRecPtr
>> >  * GetLastImportantRecPtr -- Returns the LSN of the last important record
>> >  * inserted. All records not explicitly marked as unimportant are 
>> > considered
>> >  * important.
>> >
>> > which means that we'll e.g. not notice if there's exactly a *single* WAL
>> > record since the last logged snapshot (and likely similar in the other
>> > users of GetLastImportantRecPtr()), because XLogInsert() will return
>> > where the next record will most of the time be inserted, and
>> > GetLastImportantRecPtr() returns the beginning of said record.
>> >
>> > This is trivially fixable by replacing < with <=.  But I wonder if the
>> > better fix would be to redefine GetLastImportantRecPtr() to point to the
>> > end of the record, too?
>> >
>>
>> If you think it is straightforward to note the end of the record, then
>> that sounds sensible thing to do.  However, note that we remember the
>> position based on lockno and lock is released before we can determine
>> the EndPos.
>
> I'm not sure I'm following:
>
> XLogRecPtr
> XLogInsertRecord(XLogRecData *rdata,
>  XLogRecPtr fpw_lsn,
>  uint8 flags)
> {
> ...
> /*
>  * Unless record is flagged as not important, update LSN of 
> last
>  * important record in the current slot. When holding all 
> locks, just
>  * update the first one.
>  */
> if ((flags & XLOG_MARK_UNIMPORTANT) == 0)
> {
> int lockno = holdingAllLocks ? 0 : MyLockNo;
>
> WALInsertLocks[lockno].l.lastImportantAt = StartPos;
> }
>
> is the relevant bit - what prevents us from just using EndPos instead?
>

I see that EndPos can be updated in one of the cases after releasing
the lock (refer below code). Won't that matter?

/*
* Even though we reserved the rest of the segment for us, which is
* reflected in EndPos, we return a pointer to just the end of the
* xlog-switch record.
*/

if (inserted)
{
EndPos = StartPos + SizeOfXLogRecord;
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
if (EndPos % XLOG_SEG_SIZE == EndPos % XLOG_BLCKSZ)
EndPos += SizeOfXLogLongPHD;
else
EndPos += SizeOfXLogShortPHD;
}
}




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

2017-05-05 Thread Amit Kapila
On Fri, May 5, 2017 at 7:07 AM, Andres Freund <and...@anarazel.de> wrote:
> On 2017-05-02 15:13:58 -0400, Robert Haas wrote:
>> On Tue, Apr 18, 2017 at 2:48 AM, Amit Khandekar <amitdkhan...@gmail.com> 
>> wrote:
>> The main things that keeps this from being a crippling issue right now
>> is the fact that we tend not to use that many parallel workers in the
>> first place.  We're trying to scale a query that would otherwise use 1
>> process out to 3 or 5 processes, and so the contention effects, in
>> many cases, are not too bad.  Multiple people (including David Rowley
>> as well as folks here at EnterpriseDB) have demonstrated that for
>> certain queries, we can actually use a lot more workers and everything
>> works great.  The problem is that for other queries, using a lot of
>> workers works terribly.  The planner doesn't know how to figure out
>> which it'll be - and honestly, I don't either.
>
> Have those benchmarks, even in a very informal form, been shared /
> collected / referenced centrally?
>

The numbers have been posted on parallel seq. scan the thread and more
formally shared in PGCon presentation ([1], refer slide-15).

I'd be very interested to know where
> the different contention points are. Possibilities:
>
> - in non-resident workloads: too much concurrent IOs submitted, leading
>   to overly much random IO
> - internal contention in the the parallel node, e.g. parallel seqscan
>

I think one of the points of scaling/contention is tuple
communication.  This is what is shown is perf profiles and we (one of
my colleagues is working on it) are already working on some ways to
improve the same, but I don't think we can get anywhere near to linear
scaling by improving the same.


[1] - https://www.pgcon.org/2015/schedule/events/785.en.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Change GetLastImportantRecPtr's definition? (wasSkip checkpoints, archiving on idle systems.)

2017-05-04 Thread Amit Kapila
On Fri, May 5, 2017 at 6:54 AM, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> On 2016-12-22 19:33:30 +, Andres Freund wrote:
>> Skip checkpoints, archiving on idle systems.
>
> As part of an independent bugfix I noticed that Michael & I appear to
> have introduced an off-by-one here. A few locations do comparisons like:
> /*
>  * Only log if enough time has passed and interesting records have
>  * been inserted since the last snapshot.
>  */
> if (now >= timeout &&
> last_snapshot_lsn < GetLastImportantRecPtr())
> {
> last_snapshot_lsn = LogStandbySnapshot();
> ...
>
> which looks reasonable on its face.  But LogStandbySnapshot (via XLogInsert())
>  * Returns XLOG pointer to end of record (beginning of next record).
>  * This can be used as LSN for data pages affected by the logged action.
>  * (LSN is the XLOG point up to which the XLOG must be flushed to disk
>  * before the data page can be written out.  This implements the basic
>  * WAL rule "write the log before the data".)
>
> and GetLastImportantRecPtr
>  * GetLastImportantRecPtr -- Returns the LSN of the last important record
>  * inserted. All records not explicitly marked as unimportant are considered
>  * important.
>
> which means that we'll e.g. not notice if there's exactly a *single* WAL
> record since the last logged snapshot (and likely similar in the other
> users of GetLastImportantRecPtr()), because XLogInsert() will return
> where the next record will most of the time be inserted, and
> GetLastImportantRecPtr() returns the beginning of said record.
>
> This is trivially fixable by replacing < with <=.  But I wonder if the
> better fix would be to redefine GetLastImportantRecPtr() to point to the
> end of the record, too?
>

If you think it is straightforward to note the end of the record, then
that sounds sensible thing to do.  However, note that we remember the
position based on lockno and lock is released before we can determine
the EndPos.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-04 Thread Amit Kapila
On Thu, May 4, 2017 at 10:40 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, May 4, 2017 at 1:04 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Thu, May 4, 2017 at 10:18 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Thu, May 4, 2017 at 12:18 PM, Amit Kapila <amit.kapil...@gmail.com> 
>>> wrote:
>>>> As soon as the first command fails due to timeout, we will set
>>>> 'abort_cleanup_failure' which will make a toplevel transaction to
>>>> abort and also won't allow other statements to execute.  The patch is
>>>> trying to enforce a 30-second timeout after statement execution, so it
>>>> has to anyway wait till the command execution completes (irrespective
>>>> of whether the command succeeds or fail).
>>>
>>> I don't understand what you mean by this.  If the command doesn't
>>> finish within 30 seconds, we *don't* wait for it to finish.
>>>
>>
>> + /*
>> + * Submit a query.  Since we don't use non-blocking mode, this also can
>> + * block.  But its risk is relatively small, so we ignore that for now.
>> + */
>> + if (!PQsendQuery(conn, query))
>> + {
>> + pgfdw_report_error(WARNING, NULL, conn, false, query);
>> + return false;
>> + }
>> +
>> + /* Get the result of the query. */
>> + if (pgfdw_get_cleanup_result(conn, endtime, ))
>> + return false;
>>
>> The 30 seconds logic is in function pgfdw_get_cleanup_result, can't
>> the command hang in PQsendQuery?
>
> Sure.  I thought about trying to fix that too, by using
> PQsetnonblocking(), but I thought the patch was doing enough already.
>

Okay, it seems better to deal it in a separate patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-04 Thread Amit Kapila
On Thu, May 4, 2017 at 8:08 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, May 4, 2017 at 7:13 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> In pgfdw_xact_callback, if the execution of ABORT TRANSACTION fails
>> due to any reason then I think it will close the connection.  The
>> relavant code is:
>> if (PQstatus(entry->conn) != CONNECTION_OK ||
>> PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
>>
>> Basically, if abort transaction fails then transaction status won't be
>> PQTRANS_IDLE.
>
> I don't think that's necessarily true.  Look at this code:
>
>  /* Rollback all remote subtransactions during abort */
>  snprintf(sql, sizeof(sql),
>   "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d",
>   curlevel, curlevel);
>  res = PQexec(entry->conn, sql);
>  if (PQresultStatus(res) != PGRES_COMMAND_OK)
>  pgfdw_report_error(WARNING, res, entry->conn, true, sql);
>  else
>  PQclear(res);
>
> If that sql command somehow returns a result status other than
> PGRES_COMMAND_OK, like say due to a local OOM failure or whatever, the
> connection is PQTRANS_IDLE but the rollback wasn't actually done.
>

I have tried to verify above point and it seems to me in such a
situation the transaction status will be either PQTRANS_INTRANS or
PQTRANS_INERROR.  I have tried to mimic "out of memory" failure by
making PQmakeEmptyPGresult return NULL (malloc failure).  AFAIU, it
will make the state as PQTRANS_IDLE only when the statement is
executed successfully.

>> Also, does it matter if pgfdw_subxact_callback fails
>> during execution of "ROLLBACK TO SAVEPOINT", because anyway user needs
>> to execute rollback to savepoint command before proceeding and that
>> should be good enough?
>
> I don't understand this.  If pgfdw_subxact_callback doesn't roll the
> remote side back to the savepoint, how is it going to get done later?
> There's no code in postgres_fdw other than that code to issue the
> rollback.
>

It is done via toplevel transaction ABORT.  I think how it works is
after "ROLLBACK TO SAVEPOINT" failure, any next execution is going to
fail.  Let us say if we issue Commit after failure, it will try to
execute RELEASE SAVEPOINT which will fail and lead to toplevel
transaction ABORT.  Now if the toplevel transaction ABORT also fails,
it will drop the connection.

>> About patch:
>>
>> + /* Rollback all remote subtransactions during abort */
>> + snprintf(sql, sizeof(sql),
>> + "ROLLBACK TO SAVEPOINT s%d; RELEASE SAVEPOINT s%d",
>> + curlevel, curlevel);
>> + if (!pgfdw_exec_cleanup_query(entry->conn, sql))
>> + abort_cleanup_failure = true;
>>
>> If the above execution fails due to any reason, then it won't allow
>> even committing the mail transaction which I am not sure is the right
>> thing to do.
>
> Well, as I said in my reply to Tushar, I think it's the only correct
> thing to do.  Suppose you do the following:
>
> (1) make a change to the foreign table
> (2) enter a subtransaction
> (3) make another change to the foreign table
> (4) roll back the subtransaction
> (5) commit the transaction
>
> If the remote rollback gets skipped in step 4, it is dead wrong for
> step 5 to commit the remote transaction anyway.  Both the change made
> in step (1) and the change made in step (3) would get made on the
> remote side, which is 100% wrong.  Given the failure of the rollback
> in step 4, there is no way to achieve the desired outcome of
> committing the step (1) change and rolling back the step (3) change.
> The only way forward is to roll back everything - i.e. force a
> toplevel abort.
>

Agreed, in such a situation toplevel transaction abort is the only way
out.  However, it seems to me that will anyway happen in current code
even if we don't try to force it via abort_cleanup_failure.  I
understand that it might be better to force it as you have done in the
patch, but not sure if it is good to drop the connection where
previously it could have done without it (for ex. if the first
statement Rollback To Savepoint fails, but ABORT succeeds).  I feel it
is worth considering whether such a behavior difference is okay as you
have proposed to backpatch it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-04 Thread Amit Kapila
On Thu, May 4, 2017 at 10:18 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, May 4, 2017 at 12:18 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> As soon as the first command fails due to timeout, we will set
>> 'abort_cleanup_failure' which will make a toplevel transaction to
>> abort and also won't allow other statements to execute.  The patch is
>> trying to enforce a 30-second timeout after statement execution, so it
>> has to anyway wait till the command execution completes (irrespective
>> of whether the command succeeds or fail).
>
> I don't understand what you mean by this.  If the command doesn't
> finish within 30 seconds, we *don't* wait for it to finish.
>

+ /*
+ * Submit a query.  Since we don't use non-blocking mode, this also can
+ * block.  But its risk is relatively small, so we ignore that for now.
+ */
+ if (!PQsendQuery(conn, query))
+ {
+ pgfdw_report_error(WARNING, NULL, conn, false, query);
+ return false;
+ }
+
+ /* Get the result of the query. */
+ if (pgfdw_get_cleanup_result(conn, endtime, ))
+ return false;

The 30 seconds logic is in function pgfdw_get_cleanup_result, can't
the command hang in PQsendQuery?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-04 Thread Amit Kapila
On Thu, May 4, 2017 at 8:08 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, May 4, 2017 at 7:13 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>
>>> - For bonus points, give pgfdw_exec_query() an optional timeout
>>> argument, and set it to 30 seconds or so when we're doing abort
>>> cleanup.  If the timeout succeeds, it errors out (which again
>>> re-enters the abort path, dropping the connection and forcing outer
>>> transaction levels to abort as well).
>>
>> Why do we need this 30 seconds timeout for abort cleanup failures?
>> Isn't it sufficient if we propagate the error only on failures?
>
> Well, the problem is that right now we're waiting for vast amounts of
> time when the remote connection dies.  First, the user notices that
> the command is running for too long and hits ^C.  At that point, the
> connection is probably already dead, and the user's been waiting for a
> while already.  Then, we wait for PQcancel() to time out.  Then, we
> wait for PQexec() to time out.  The result of that, as Suraj mentioned
> in the original email, is that it takes about 16 minutes for the
> session to recover when the connection dies, even with keepalive
> settings and connection timeout set all the way to the minimum.  The
> goal here is to make it take a more reasonable time to recover.
> Without modifying libpq, we can't do anything about the need to wait
> for PQcancel() to time out, but we can improve the rest of it.  If we
> removed that 30-second timeout, we would win in the case where either
> ABORT TRANSACTION or ROLLBACK; RELEASE eventually succeeds but takes
> more than 30 seconds.  In that case, the change you are proposing
> would allow us to reuse the connection instead of dropping it, and
> would prevent a forced toplevel abort when subtransactions are in use.
> However, the cost would be that when one of those commands never
> succeeds, we would wait a lot longer before finishing a transaction
> abort.
>

As soon as the first command fails due to timeout, we will set
'abort_cleanup_failure' which will make a toplevel transaction to
abort and also won't allow other statements to execute.  The patch is
trying to enforce a 30-second timeout after statement execution, so it
has to anyway wait till the command execution completes (irrespective
of whether the command succeeds or fail).  It is quite possible I am
missing some point, is it possible for you to tell in somewhat more
detail in which exact case 30-second timeout will allow us to abort
the toplevel transaction if we already do that in the case of
statement failure case?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] statement_timeout is not working as expected with postgres_fdw

2017-05-04 Thread Amit Kapila
, curlevel);
+ if (!pgfdw_exec_cleanup_query(entry->conn, sql))
+ abort_cleanup_failure = true;

If the above execution fails due to any reason, then it won't allow
even committing the mail transaction which I am not sure is the right
thing to do.  I have tried it manually editing the above command in
debugger and the result is as below:

Create a foreign table and try below scenario.

postgres=# begin;
BEGIN
postgres=# savepoint s1;
SAVEPOINT
postgres=# insert into foreign_test values(8);
INSERT 0 1
postgres=# savepoint s2;
SAVEPOINT
postgres=# insert into foreign_test values(generate_series(1,100));
Cancel request sent
WARNING:  syntax error at or near "OOLLBACK"
ERROR:  canceling statement due to user request

-- In the debugger, I have changed ROLLBACK to mimic the failure.

postgres=# Rollback to savepoint s2;
ROLLBACK
postgres=# commit;
ERROR:  connection to server "foreign_server" was lost

Considering above, I am not sure if we should consider all failures
during abort of subtransaction to indicate pending cleanup state and
then don't allow further commits.

 #include "utils/memutils.h"
-
+#include "utils/syscache.h"

Looks like spurious line removal

> 6. postgres_fdw doesn't use PQsetnonblocking() anywhere, so even if we
> converted pgfdw_xact_callback() and pgfdw_subxact_callback() to use
> pgfdw_exec_query() or something like it rather than PQexec() to submit
> queries, they might still block if we fail to send the query, as the
> comments for pgfdw_exec_query() explain.  This is not possibly but not
> particularly likely to happen for queries being sent out of error
> handling paths, because odds are good that the connection was sent
> just before.  However, if the user is pressing ^C because the remote
> server isn't responding, it's quite probable that we'll run into this
> exact issue.
>
> 7. postgres_fdw never considers slamming the connection shut as a
> response to trouble.  It seems pretty clear that this is only a safe
> response if we're aborting the toplevel transaction.  If we did it for
> a subtransaction, we'd end up reconnecting if the server were accessed
> again, which would at the very least change the snapshot (note that we
> use at least REPEATABLE READ on the remote side regardless of the
> local isolation level) and might discard changes made on the remote
> side at outer transaction levels.  Even for a top-level transaction,
> it might not always be the right way to proceed, as it incurs some
> overhead to reconnect, but it seems worth considering at least in the
> case where we've already got some indication that the connection is
> unhealthy (e.g. a previous attempt to clean up the connection state
> failed, as in #5, or PQcancel failed, as in Ashutosh's proposed
> patch).
>




> 8. Before 9.6, PQexec() is used in many more places, so many more
> queries are entirely non-interruptible.
>
> It seems pretty clear to me that we are not going to fix all of these
> issues in one patch.  Here's a sketch of an idea for how to start
> making things better:
>
> - Add an in_abort_cleanup flag to ConnCacheEntry.
>
> - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin abort
> cleanup, check whether the flag is set.  If so, slam the connection
> shut unless that's already been done; furthermore, if the flag is set
> and we're in pgfdw_xact_callback (i.e. this is a toplevel abort),
> forget about the connection entry entirely.  On the other hand, if the
> flag is not set, set it flag and attempt abort cleanup.  If we
> succeed, clear the flag.
>
> - Before pgfdw_xact_callback() or pgfdw_subxact_callback() begin
> pre-commit processing, check whether the flag is set.  If so, throw an
> ERROR, so that we switch over to abort processing.
>
> - Change uses of PQexec() in the abort path to use pgfdw_exec_query()
> instead.  If we exit pgfdw_exec_query() with an error, we'll re-enter
> the abort path, but now in_abort_cleanup will be set, so we'll just
> drop the connection (and force any outer transaction levels to abort
> as well).
>
> - For bonus points, give pgfdw_exec_query() an optional timeout
> argument, and set it to 30 seconds or so when we're doing abort
> cleanup.  If the timeout succeeds, it errors out (which again
> re-enters the abort path, dropping the connection and forcing outer
> transaction levels to abort as well).
>

Why do we need this 30 seconds timeout for abort cleanup failures?
Isn't it sufficient if we propagate the error only on failures?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-01 Thread Amit Kapila
On Fri, Apr 28, 2017 at 10:02 PM, Dmitriy Sarafannikov
<dsarafanni...@yandex.ru> wrote:
>
> What I'm thinking of is the regular indexscan that's done internally
> by get_actual_variable_range, not whatever ends up getting chosen as
> the plan for the user query.  I had supposed that that would kill
> dead index entries as it went, but maybe that's not happening for
> some reason.
>
>
> Really, this happens as you said. Index entries are marked as dead.
> But after this, backends spends cpu time on skip this killed entries
> in _bt_checkkeys :
>

If that is the case, then how would using SnapshotAny solve this
problem.  We get the value from index first and then check its
visibility in heap, so if time is spent in _bt_checkkeys, why would
using a different kind of Snapshot solve the problem?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: ALTER SYSTEM [...] COMMENT

2017-05-01 Thread Amit Kapila
On Wed, Apr 26, 2017 at 10:33 PM, Joshua D. Drake <j...@commandprompt.com> 
wrote:
> I propose:
>
> Add a column to pg_settings comment(text)
> Change the grammar to allow:
>
> ALTER SYSTEM SET configuration_parameter { TO | = } { value | 'value' |
> DEFAULT } COMMENT 'comment'
>
> Example:
>
> ALTER SYSTEM SET maintenance_work_mem TO '1GB' COMMENT IS 'Increased to
> allow autovacuum to be more efficient';
>
> Potential issues:
>
> Does not use existing comment functionality. Alternate solution which would
> decrease functionality is:
>
> COMMENT ON SETTING setting IS 'comment';
>

I think syntax wise, it makes sense to have both the forms in some way
as proposed by you.  I think providing COMMENT along with ALTER SYSTEM
has the advantage that user can specify it when it specifies the
setting and by providing a separate command facilitates to use it even
when we just want to add/change the comments for a particular setting.

I think Robert [1] and I [2] have made some comments when this topic
was previously discussed which are worth looking if you or someone is
planning to write a patch for this feature.


[1] - 
https://www.postgresql.org/message-id/CA%2BTgmoZBDLhDexHyTJ%3DH0xZt7-6M%3Dh%2Bv2Xi0Myj7Q37QGZQb4g%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1%2B%3DovYQqYGHcX2OBU3mk%2BhSHjFDpvmrHCos1Vgj8_b6vg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OK, so culicidae is *still* broken

2017-04-30 Thread Amit Kapila
On Mon, May 1, 2017 at 10:23 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>> Yeah, that's right.  Today, I have spent some time to analyze how and
>> where retry logic is required.  I think there are two places where we
>> need this retry logic, one is if we fail to reserve the memory
>> (pgwin32_ReserveSharedMemoryRegion) and second is if we fail to
>> reattach (PGSharedMemoryReAttach).
>
> I'm not following.  The point of the reserve operation is to ensure
> that the reattach will work.  What makes you think we need to add more
> code at that end?
>

We use VirtualAllocEx to allocate memory at a pre-defined address
(reserve operation) which can fail due to ASLR.  One recent example of
such a symptom is seen on pgsql-bugs [1], that failure is during
reserve operation and seems like something related to ASLR.  Another
point is the commit 7f3e17b4827b61ad84e0774e3e43da4c57c4487f (It
doesn't fail every time (which is explained by the Random part in
ASLR), but can fail with errors abut failing to reserve shared memory
region) also indicates that we can fail to reserve memory due to ASLR.


[1] - https://www.postgresql.org/message-id/14121.1485360296%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OK, so culicidae is *still* broken

2017-04-30 Thread Amit Kapila
On Tue, Apr 25, 2017 at 7:37 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Craig Ringer <cr...@2ndquadrant.com> writes:
>> On 25 Apr. 2017 13:37, "Heikki Linnakangas" <hlinn...@iki.fi> wrote:
>>> For some data shared memory structures, that store no pointers, we wouldn't
>>> need to insist that they are mapped to the same address in every backend,
>>> though. In particular, shared_buffers. It wouldn't eliminate the problem,
>>> though, only make it less likely, so we'd still need to retry when it does
>>> happen.
>
>> Good point. Simply splitting out shared_buffers into a moveable segment
>> would make a massive difference. Much less chance of losing the dice roll
>> for mapping the fixed segment.
>
>> Should look at what else could be made cheaply relocatable too.
>
> I don't think it's worth spending any effort on.  We need the retry
> code anyway, and making it near impossible to hit that would only mean
> that it's very poorly tested.  The results upthread say that it isn't
> going to be hit often enough to be a performance problem, so why worry?
>

Yeah, that's right.  Today, I have spent some time to analyze how and
where retry logic is required.  I think there are two places where we
need this retry logic, one is if we fail to reserve the memory
(pgwin32_ReserveSharedMemoryRegion) and second is if we fail to
reattach (PGSharedMemoryReAttach). If we fail during reserve memory
operation, then we can simply terminate the process and recreate it
again, basically, add some kind of loop in internal_forkexec(). OTOH,
if we fail during reattach, I think we need an exit from the process
which means it can lead to a restart of all the processes unless we
want to add some special logic for handling process exit or
alternatively we might want to just try reattach operation in a loop
before giving up. Do we want to keep this retry logic for a certain
number of times say (10 times) or we want to try indefinitely?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-27 Thread Amit Kapila
On Thu, Apr 27, 2017 at 7:53 PM, Marko Tiikkaja <ma...@joh.to> wrote:
> On Thu, Apr 27, 2017 at 4:13 PM, Bruce Momjian <br...@momjian.us> wrote:
>>
>> On Thu, Apr 27, 2017 at 08:00:28AM +0530, Amit Kapila wrote:
>> > > Oh, so non-correlated subqueries can be run in parallel.  Yes, that is
>> > > something we should have in the release notes.  How is this?
>> > >
>> > > Author: Robert Haas <rh...@postgresql.org>
>> > > 2017-02-14 [5e6d8d2bb] Allow parallel workers to execute
>> > > subplans.
>> > >
>> > > Allow non-correlated subqueries to be run in parallel (Amit
>> > > Kapila)
>> > >
>> >
>> > Looks good to me.
>
>
> It's not clear from this item what the previous behavior was.  How about
> adding something like "Correlated subqueries can not yet be parallelized."?
> (I'm guessing that's the change, anyway, please correct me if I'm mistaken.)
>

Previous to this change queries containing a reference to subplans
(correlated or un-correlated) doesn't get parallelised.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-26 Thread Amit Kapila
On Tue, Apr 25, 2017 at 8:59 AM, Bruce Momjian <br...@momjian.us> wrote:
> On Mon, Apr 24, 2017 at 11:05:41PM -0400, Bruce Momjian wrote:
>> > I think the above commit needs a separate mention, as this is a really
>> > huge step forward to control the size of hash indexes.
>>
>> Yes, it is unfotunate that the item is in the incompatibility item.  I
>> wonder if I should split out the need to rebuild the hash indexes and
>> keep it there and move this item into the "Index" section.
>
> Done, items split.
>


    

 Improve efficiency of hash index growth (Amit Kapila, Mithun Cy)

 

The first two commits b0f18cb77, 30df93f69 are done as preliminary
work to "Add write-ahead logging support to hash indexes", so it seems
inappropriate to add them here.  We can add it along with below item:

 

 Add write-ahead logging support to hash indexes (Amit Kapila)




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-26 Thread Amit Kapila
On Wed, Apr 26, 2017 at 8:46 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Wed, Apr 26, 2017 at 07:38:05PM +0530, Amit Kapila wrote:
>> >> I have already mentioned the commit id (5e6d8d2b).  Text can be "Allow
>> >> queries containing subplans to execute in parallel".  We should also
>> >> mention in some way that this applies only when the query contains
>> >> uncorrelated subplan.
>> >
>> > Sorry but I don't know what that means, and if I don't know, others
>> > might not either.
>> >
>>
>> regression=# explain (costs off) select count(*) from tenk1 where
>> (two, four) not in (select hundred, thousand from tenk2 where
>> thousand > 100);
>>   QUERY PLAN
>> --
>>  Finalize Aggregate
>>->  Gather
>>  Workers Planned: 2
>>  ->  Partial Aggregate
>>->  Parallel Seq Scan on tenk1
>>  Filter: (NOT (hashed SubPlan 1))
>>  SubPlan 1
>>->  Seq Scan on tenk2
>>  Filter: (thousand > 100)
>> (9 rows)
>>
>
> Oh, so non-correlated subqueries can be run in parallel.  Yes, that is
> something we should have in the release notes.  How is this?
>
> Author: Robert Haas <rh...@postgresql.org>
> 2017-02-14 [5e6d8d2bb] Allow parallel workers to execute subplans.
>
> Allow non-correlated subqueries to be run in parallel (Amit Kapila)
>

Looks good to me.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-26 Thread Amit Kapila
On Tue, Apr 25, 2017 at 7:19 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Tue, Apr 25, 2017 at 09:00:45AM +0530, Amit Kapila wrote:
>> On Tue, Apr 25, 2017 at 8:35 AM, Bruce Momjian <br...@momjian.us> wrote:
>> > On Tue, Apr 25, 2017 at 08:30:50AM +0530, Amit Kapila wrote:
>> >> On Tue, Apr 25, 2017 at 7:01 AM, Bruce Momjian <br...@momjian.us> wrote:
>> >> > I have committed the first draft of the Postgres 10 release notes.  They
>> >> > are current as of two days ago, and I will keep them current.  Please
>> >> > give me any feedback you have.
>> >> >
>> >>
>> >> Some of the items which I feel could be added:
>> >>
>> >> 5e6d8d2bbbcace304450b309e79366c0da4063e4
>> >> Allow parallel workers to execute subplans.
>> >
>> > Uh, can you show me the commit on that and give some text ideas?
>> >
>>
>> I have already mentioned the commit id (5e6d8d2b).  Text can be "Allow
>> queries containing subplans to execute in parallel".  We should also
>> mention in some way that this applies only when the query contains
>> uncorrelated subplan.
>
> Sorry but I don't know what that means, and if I don't know, others
> might not either.
>

Let me try to explain by example:

Without this feature, the queries that refer subplans will be executed
serially like below:

regression=# explain (costs off) select count(*) from tenk1 where
(two, four) not in (select hundred, thousand from tenk2 where thousand
> 100);
QUERY PLAN
--
 Aggregate
   ->  Seq Scan on tenk1
 Filter: (NOT (hashed SubPlan 1))
 SubPlan 1
   ->  Seq Scan on tenk2
 Filter: (thousand > 100)
(6 rows)

After this feature, the queries that refer subplans can use parallel
plans like below:

regression=# explain (costs off) select count(*) from tenk1 where
(two, four) not in (select hundred, thousand from tenk2 where
thousand > 100);
  QUERY PLAN
--
 Finalize Aggregate
   ->  Gather
 Workers Planned: 2
 ->  Partial Aggregate
   ->  Parallel Seq Scan on tenk1
 Filter: (NOT (hashed SubPlan 1))
 SubPlan 1
   ->  Seq Scan on tenk2
 Filter: (thousand > 100)
(9 rows)


Now, it won't use parallelism if there is correlated subplan like below:

Seq Scan on t1
   Filter: (SubPlan 1)
   SubPlan 1
   ->  Result
 One-Time Filter: (t1.k = 0)
 ->  Parallel Seq Scan on t2

In this plan difference is that SubPlan refers to outer relation t1.

Do the above examples helps in understanding the feature?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-25 Thread Amit Kapila
On Tue, Apr 25, 2017 at 7:45 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Tue, Apr 25, 2017 at 09:38:55AM +0530, Rafia Sabih wrote:
>> On Tue, Apr 25, 2017 at 9:15 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> > Andres Freund <and...@anarazel.de> writes:
>> >> On 2017-04-24 23:37:42 -0400, Bruce Momjian wrote:
>> >>> I remember seeing those and those are normally details I do not put in
>> >>> the release notes as there isn't a clear user experience change except
>> >>> "Postgres is faster".  Yeah, a bummer, and I can change my filter, but
>> >>> it would require discussion.
>> >
>> >> I think "postgres is faster" is one of the bigger user demands, so I
>> >> don't think that policy makes much sense.  A large number of the changes
>> >> over the next few releases will focus solely on that.  Nor do I think
>> >> past release notes particularly filtered such changes out.
>> >
>> > I think it has been pretty common to accumulate a lot of such changes
>> > into generic entries like, say, "speedups for hash joins".  More detail
>> > than that simply isn't useful to end users; and as a rule, our release
>> > notes are too long anyway.
>> >
>> > regards, tom lane
>> >
>> >
>> Just wondering if the mention of commit
>> 0414b26bac09379a4cbf1fbd847d1cee2293c5e4 is missed? Not sure if this
>> requires a separate entry or could be merged with -- Support parallel
>> btree index scans.
>
> This item was merged into the general parallel index scan item:
>

Okay, but then shouldn't we add Rafia's name as well to that release
notes entry as she is the author of one of the merged item (commit
0414b26bac09379a4cbf1fbd847d1cee2293c5e4) which has an independent
value.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2017-04-25 Thread Amit Kapila
On Tue, Apr 25, 2017 at 2:09 PM, Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp> wrote:
>
> I'm not good at composition, so I cannot insist on my
> proposal. For the convenience of others, here is the proposal
> from Fujii-san.
>

Do you see any problem with the below proposal?  To me, this sounds reasonable.

> + A quorum-based synchronous replication is basically more efficient than
> + a priority-based one when you specify multiple standbys in
> + synchronous_standby_names and want to replicate
> + the transactions to some of them synchronously. In this case,
> + the transactions in a priority-based synchronous replication must wait 
> for
> + reply from the slowest standby in synchronous standbys chosen based on
> + their priorities, and which may increase the transaction latencies.
> + On the other hand, using a quorum-based synchronous replication may
> + improve those latencies because it makes the transactions wait only for
> + replies from the requested number of faster standbys in all the listed
> + standbys, i.e., such slow standby doesn't block the transactions.
>

Can we do few modifications like:
improve those latencies --> reduce those latencies
such slow standby --> a slow standby

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-24 Thread Amit Kapila
On Tue, Apr 25, 2017 at 8:35 AM, Bruce Momjian <br...@momjian.us> wrote:
> On Tue, Apr 25, 2017 at 08:30:50AM +0530, Amit Kapila wrote:
>> On Tue, Apr 25, 2017 at 7:01 AM, Bruce Momjian <br...@momjian.us> wrote:
>> > I have committed the first draft of the Postgres 10 release notes.  They
>> > are current as of two days ago, and I will keep them current.  Please
>> > give me any feedback you have.
>> >
>>
>> Some of the items which I feel could be added:
>>
>> 5e6d8d2bbbcace304450b309e79366c0da4063e4
>> Allow parallel workers to execute subplans.
>
> Uh, can you show me the commit on that and give some text ideas?
>

I have already mentioned the commit id (5e6d8d2b).  Text can be "Allow
queries containing subplans to execute in parallel".  We should also
mention in some way that this applies only when the query contains
uncorrelated subplan.

>> 61c2e1a95f94bb904953a6281ce17a18ac38ee6d
>> Improve access to parallel query from procedural languages.
>
> I think I have that:
>
> Increase parallel query usage in procedural language functions (Robert
> Haas)
>
>> In Parallel Queries section, we can add above two items as they
>> increase the usage of the parallel query in many cases.
>>
>> ea69a0dead5128c421140dc53fac165ba4af8520
>> Expand hash indexes more gradually.
>
> That is in this item:
>
> Improve hash bucket split performance by reducing locking requirements
> (Amit Kapila, Mithun Cy)
>
> Also cache hash index meta-information for faster lookups. Additional
> hash performance improvements have also been made. pg_upgrade'd hash
> indexes from previous major Postgres versions must be rebuilt.
>
> Can you suggest additional wording?
>

Allow hash indexes to expand slowly

This will help in controlling the size of hash indexes after the split.

>  I did merge many of the hash items
> into this so it would be understandable.  You can see the commits in the
> SGML source.
>
>> I think the above commit needs a separate mention, as this is a really
>> huge step forward to control the size of hash indexes.
>
> Yes, it is unfotunate that the item is in the incompatibility item.  I
> wonder if I should split out the need to rebuild the hash indexes and
> keep it there and move this item into the "Index" section.
>

That sounds sensible.




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-24 Thread Amit Kapila
On Tue, Apr 25, 2017 at 8:30 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Tue, Apr 25, 2017 at 7:01 AM, Bruce Momjian <br...@momjian.us> wrote:
>> I have committed the first draft of the Postgres 10 release notes.  They
>> are current as of two days ago, and I will keep them current.  Please
>> give me any feedback you have.
>>
>
> Some of the items which I feel could be added:
>
> 5e6d8d2bbbcace304450b309e79366c0da4063e4
> Allow parallel workers to execute subplans.
>
> 61c2e1a95f94bb904953a6281ce17a18ac38ee6d
> Improve access to parallel query from procedural languages.
>
> In Parallel Queries section, we can add above two items as they
> increase the usage of the parallel query in many cases.
>

I see the second one in release notes, but maybe we should credit
Rafia Sabih as well for that feature as she was the original author of
the patch and has helped in fixing the bugs that occurred due to that
feature.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-24 Thread Amit Kapila
On Tue, Apr 25, 2017 at 7:01 AM, Bruce Momjian <br...@momjian.us> wrote:
> I have committed the first draft of the Postgres 10 release notes.  They
> are current as of two days ago, and I will keep them current.  Please
> give me any feedback you have.
>

Some of the items which I feel could be added:

5e6d8d2bbbcace304450b309e79366c0da4063e4
Allow parallel workers to execute subplans.

61c2e1a95f94bb904953a6281ce17a18ac38ee6d
Improve access to parallel query from procedural languages.

In Parallel Queries section, we can add above two items as they
increase the usage of the parallel query in many cases.

ea69a0dead5128c421140dc53fac165ba4af8520
Expand hash indexes more gradually.

I think the above commit needs a separate mention, as this is a really
huge step forward to control the size of hash indexes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OK, so culicidae is *still* broken

2017-04-24 Thread Amit Kapila
On Fri, Apr 21, 2017 at 12:55 AM, Andres Freund <and...@anarazel.de> wrote:
> On 2017-04-20 16:57:03 +0530, Amit Kapila wrote:
>> On Wed, Apr 19, 2017 at 9:01 PM, Andres Freund <and...@anarazel.de> wrote:
>> > On 2017-04-19 10:15:31 -0400, Tom Lane wrote:
>> >> Amit Kapila <amit.kapil...@gmail.com> writes:
>> >> > On Sun, Apr 16, 2017 at 3:04 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> >> >> Obviously, any such fix would be a lot more likely to be reliable in
>> >> >> 64-bit machines.  There's probably not enough daylight to be sure of
>> >> >> making it work in 32-bit Windows, so I suspect we'd need some retry
>> >> >> logic anyway for that case.
>> >>
>> >> > Yeah, that kind of thing can work assuming we don't get conflicts too
>> >> > often, but it could be possible that conflicts are not reported from
>> >> > ASLR enabled environments because of commit 7f3e17b4.
>> >>
>> >> Right, but Andres' point is that we should make an effort to undo that
>> >> hack and instead allow ASLR to happen.  Not just because it's allegedly
>> >> more secure, but because we may have no choice in future Windows versions.
>> >
>> > FWIW, I think it *also* might make us more secure, because addresses in
>> > the postgres binary won't be predictable anymore.
>> >
>>
>> Agreed.  I have done some further study by using VMMap tool in Windows
>> and it seems to me that all 64-bit processes use address range
>> (0001 ~ 07FE).  I have attached two screen
>> shots to show the address range (lower range and upper range).  You
>> need to refer the lower half of the window in attached screenshots.
>> At this stage, I am not completely sure whether we can assume some
>> address out of this range to use in MapViewOfFileEx.  Let me know your
>> thoughts.
>
> That seems like a fairly restricted range (good!), and squares with
> memories of reading about this (can't find the reference anymore).  Just
> using something like 0x0F00 as the address might work
> sufficiently well.  What happens if you just hardcode that in the first
> MapViewOfFileEx call, and change RandomizedBaseAddress="FALSE" in
> src/tools/msvc/VCBuildProject.pm to true?
>

The server start generates an error:
2017-04-24 12:02:05.771 IST [8940] FATAL:  could not create shared
memory segment: error code 87
2017-04-24 12:02:05.771 IST [8940] DETAIL:  Failed system call was
MapViewOfFileEx.

Error code 87 means "invalid parameter".  Some googling [1] indicates
such an error occurs if we pass the out-of-range address to
MapViewOfFileEx. Another possible theory is that we must pass the
address as multiple of the system's memory allocation granularity as
that is expected by specs of MapViewOfFileEx.  I can try doing that
but I am not confident that this is the right approach because of
below text mentioned in docs (msdn) of MapViewOfFileEx.
"While it is possible to specify an address that is safe now (not used
by the operating system), there is no guarantee that the address will
remain safe over time. Therefore, it is better to let the operating
system choose the address.".

I have printed the addresses that system automatically picks for
MapViewOfFileEx (3a2, 377, 352, 372, 373) and they
seem to be in a range which I have posted up thread for 64-bit
processes.

Another thing I have tried is to just start the server by setting
RandomizedBaseAddress="TRUE".  I have tried about 15-20 times but
could not reproduce the problem related to shared memory attach.  We
have tried the same on one of my colleagues (Ashutosh Sharma) machine
as well, there we could see that error once or twice out of many tries
but couldn't get it consistently.  I think if the chances of this
problem to occur are so less, then probably the suggestion made by Tom
to retry if we get collision doesn't sound bad.

[1] - 
https://support.microsoft.com/en-us/help/125713/common-file-mapping-problems-and-platform-differences

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dtrace probes

2017-04-20 Thread Amit Kapila
On Tue, Apr 18, 2017 at 9:38 PM, Jesper Pedersen
<jesper.peder...@redhat.com> wrote:
> Hi,
>
> The lwlock dtrace probes define LWLockMode as int, and the
> TRACE_POSTGRESQL_LWLOCK methods are called using both a variable and
> constant definition.
>
> This leads to a mix of argument definitions depending on the call site, as
> seen in probes.txt file.
>
> A fix is to explicit cast 'mode' to int such that all call sites will use
> the
>
>  argument #2 4 signed   bytes
>
> definition. Attached patch does this.
>

I think this fix is harmless and has some value in terms of
consistency.  One minor suggestion is that you should leave a space
after typecasting.

- TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), mode);
+ TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(lock), (int)mode);

There should be a space like "(int) mode".


> I have verified all dtraces probes for their type, and only the lock__
> methods doesn't aligned with its actual types.
>

Do you see any problem with that?

>
> Depending on the feedback I can add this patch to the open item list in
> order to fix it for PostgreSQL 10.
>

Is there any commit in PG-10 which has caused this behavior?  If not,
then I don't think it should be added to open items of PG-10.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] OK, so culicidae is *still* broken

2017-04-19 Thread Amit Kapila
On Sun, Apr 16, 2017 at 3:04 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Andres Freund <and...@anarazel.de> writes:
>> On 2017-04-15 17:24:54 -0400, Tom Lane wrote:
>>> I wonder whether we could work around that by just destroying the created
>>> process and trying again if we get a collision.  It'd be a tad
>>> inefficient, but hopefully collisions wouldn't happen often enough to be a
>>> big problem.
>
>> That might work, although it's obviously not pretty.  We could also just
>> default to some out-of-the-way address for MapViewOfFileEx, that might
>> also work.
>
> Could be.  Does Microsoft publish any documentation about the range of
> addresses their ASLR uses?
>

I have look around to find some information to see if there is any
such address range which could be used for our purpose.  I am not able
to see any such predictable address range.  You might want to read the
article [1] especially the text around "What is the memory address
space range in virtual memory map where system DLLs and user DLLs
could load?"   It seems to indicate that there is no such address
unless I have misunderstood it.  I don't deny the possibility of
having such an address range, but I could not find any info on the
same.

> Obviously, any such fix would be a lot more likely to be reliable in
> 64-bit machines.  There's probably not enough daylight to be sure of
> making it work in 32-bit Windows, so I suspect we'd need some retry
> logic anyway for that case.
>

Yeah, that kind of thing can work assuming we don't get conflicts too
often, but it could be possible that conflicts are not reported from
ASLR enabled environments because of commit 7f3e17b4.

[1] - 
https://blogs.msdn.microsoft.com/winsdk/2009/11/30/how-to-disable-address-space-layout-randomization-aslr/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate parallel-safety check for SubPlans

2017-04-18 Thread Amit Kapila
On Wed, Apr 19, 2017 at 1:27 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>> I have ended up doing something along the lines suggested by you (or
>> at least what I have understood from your e-mail).  Basically, pass
>> the safe-param-ids list to parallel safety function and decide based
>> on that if Param node reference in input expression is safe.
>
> I did not like changing the signature of is_parallel_safe() for this:
> that's getting a lot of call sites involved in something they don't care
> about, and I'm not even sure that it's formally correct.  The key thing
> here is that a Param could only be considered parallel-safe in context.
> That is, if you looked at the testexpr alone and asked if it could be
> pushed down to a worker, the answer would have to be "no".  Only when
> you look at the whole SubPlan tree and ask if it can be pushed down
> should the answer be "yes".  Therefore, I feel pretty strongly that
> what we want is for the safe_param_ids whitelist to be mutated
> internally in is_parallel_safe() as it descends the tree.  That way
> it can give the right answer when considering a fragment of a plan.
> I've committed a patch that does it like that.
>

Thanks.

>> ... Also, I think we don't need to check
>> parallel-safety for splan->args as that also is related to correlated
>> queries for which parallelism is not supported as of now.
>
> I think leaving that sort of thing out is just creating a latent bug
> that is certain to bite you later.  It's true that as long as the args
> list contains only Vars, it would never be parallel-unsafe --- but
> primnodes.h is pretty clear that one shouldn't assume that it will
> stay that way.

Sure, but the point I was trying to make was whenever subplan has
args, I think it won't be parallel-safe as those args are used to pass
params.

>  So it's better to recurse over the whole tree rather
> than ignoring parts of it, especially if you're not going to document
> the assumption you're making anywhere.
>

No problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate parallel-safety check for SubPlans

2017-04-18 Thread Amit Kapila
On Mon, Apr 17, 2017 at 10:54 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>> On Sun, Apr 16, 2017 at 9:30 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> FYI, I have this on my to-look-at list, and expect to fix it before Robert
>>> returns from vacation.
>
>> Let me know if an initial patch by someone else can be helpful?
>
> Sure, have a go at it.  I won't get to this for a day or so ...
>

I have ended up doing something along the lines suggested by you (or
at least what I have understood from your e-mail).  Basically, pass
the safe-param-ids list to parallel safety function and decide based
on that if Param node reference in input expression is safe.  Note
that I have added a new parameter paramids list to build_subplan to
make AlternativeSubPlan case consistent with SubPlan case even though
we should be able to do without that as it seems to be exercised only
for the correlated EXISTS case.  Also, I think we don't need to check
parallel-safety for splan->args as that also is related to correlated
queries for which parallelism is not supported as of now.

I have also explored it to do it by checking parallel-safety of
testexpr before PARAM_EXEC params are replaced in testexpr, however
that won't work because we add PARAM_SUBLINK params at the time of
parse-analysis in transformSubLink().  I am not sure if it is a good
idea to test parallel-safety of testexpr at the time of parse-analysis
phase, so didn't pursue that path.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_parallel_safety_info_subplans_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate parallel-safety check for SubPlans

2017-04-16 Thread Amit Kapila
On Sun, Apr 16, 2017 at 9:30 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Noah Misch <n...@leadboat.com> writes:
>> On Wed, Apr 12, 2017 at 02:41:09PM -0400, Tom Lane wrote:
>>> This is 100% wrong.  It's failing to recurse into the subexpressions of
>>> the SubPlan, which could very easily include parallel-unsafe function
>>> calls.  Example:
>
>> The above-described topic is currently a PostgreSQL 10 open item.  Robert,
>> since you committed the patch believed to have created it, you own this open
>> item.  If some other commit is more relevant or if this does not belong as a
>> v10 open item, please let us know.
>
> FYI, I have this on my to-look-at list, and expect to fix it before Robert
> returns from vacation.
>

Let me know if an initial patch by someone else can be helpful?  I was
not completely sure whether we need to pass any sort of whitelist of
params for parallel safety, but if you think that is the better way to
go, I can give it a try.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Typo in htup_details.h

2017-04-15 Thread Amit Kapila
Attached patch to fix $SUBJECT.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


typo_htup_details.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] ERROR: badly formatted node string "RESTRICTINFO...

2017-04-12 Thread Amit Kapila
On Wed, Apr 12, 2017 at 10:30 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>> On Wed, Apr 12, 2017 at 12:40 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> Anyone want to draft a patch for this?
>
>> Please find patch attached based on above discussion.
>
> This patch seems fairly incomplete: you can't just whack around major data
> structures like PlannedStmt and PlannerGlobal without doing the janitorial
> work of cleaning up support logic such as outfuncs/readfuncs.
>

Oops, missed it.

> Also, when you think about what will happen when doing a copyObject()
> on a completed plan, it seems like a pretty bad idea to link subplans
> into two independent lists.  We'll end up with two separate copies of
> those subtrees in places like the plan cache.
>
> I'm inclined to think the other approach of adding a parallel_safe
> flag to struct Plan is a better answer in the long run.  It's basically
> free to have it there because of alignment considerations, and it's
> hard to believe that we're not going to need that labeling at some
> point in the future anyway.
>
> I had been a bit concerned about having to have some magic in outfuncs
> and/or readfuncs to avoid transferring the unsafe subplan(s), but I see
> from your patch there's a better way: we can have ExecSerializePlan modify
> the subplan list as it transfers it into its private PlannedStmt struct.
> But I think iterating over the list and examining each subplan's
> parallel_safe marking is a better way to do that.
>
> Will work on this approach.
>

Thanks, I see that you have committed patch on those lines.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate parallel-safety check for SubPlans

2017-04-12 Thread Amit Kapila
On Thu, Apr 13, 2017 at 1:05 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Wed, Apr 12, 2017 at 2:41 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> This is 100% wrong.  It's failing to recurse into the subexpressions of
>>> the SubPlan, which could very easily include parallel-unsafe function
>>> calls.
>
>> My understanding (apparently flawed?) is that the parallel_safe flag
>> on the SubPlan is supposed to encapsulate whether than SubPlan is
>> entirely parallel safe, so that no recursion is needed.  If the
>> parallel_safe flag on the SubPlan is being set wrong, doesn't that
>> imply that the Path from which the subplan was created had the wrong
>> value of this flag, too?
>
>> ...
>
>> Oh, I see.  The testexpr is separate from the Path.  Oops.
>
> Right.  Although you're nudging up against an alternate idea that
> I just had: we could define the parallel_safe flag on the SubPlan as
> encapsulating not only whether the child plan is safe, but whether
> the contained testexpr (and args) are safe.  If we were to make
> an is_parallel_safe() check on the testexpr before injecting
> PARAM_EXEC Params into it, and record the results in the SubPlan,
> maybe that would lead to a simpler answer.
>
> OTOH, that might not work out nicely; and I have a feeling that
> we'd end up needing a whitelist anyway later, to handle the
> case of correlated subplans.
>

The problem with supporting correlated subplans is that it is not easy
to decide about params that belong to whitelist because we need to
ensure that such params are available below the gather node.   It is
quite clear how whitelist of params can be useful for the case in hand
but it is not immediately clear to me how we will use it for
correlated sublans.   However, if you have ideas on how to make it
work, I can try to draft a patch as well on those lines as it can
certainly help some TPC-H queries.

>> Sounds reasonable.  Do you want to have a go at that?
>
> Yeah.  I'm knocking off for the day a bit early today, but I'll have
> a look at it tomorrow, unless anyone in the Far East beats me to it.
>

Thanks for looking into it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TAP tests take a long time

2017-04-12 Thread Amit Kapila
On Tue, Apr 11, 2017 at 9:38 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> On Tue, Apr 11, 2017 at 11:32 AM, Andrew Dunstan
>> <andrew.duns...@2ndquadrant.com> wrote:
>>> This buildfarm run as you can see takes 33m32s, and the Tap tests take a
>>> combined 19m52s of that time.
>
>> I don't think it's quite fair to complain about the TAP tests taking a
>> long time to run as a general matter.  Many people here have long
>> wanted a separate test-suite for long running tests that can be run to
>> really check everything possible, and apparently, TAP tests are it.
>
>> What I think would be more useful is to break down where the time is
>> getting spent.  It may be that some of those tests are not adding
>> value proportionate to their runtime.
>
> The other thing that might be useful here is to push on parallelizing
> buildfarm runs.  Admittedly that will do nothing for the oldest and
> slowest buildfarm critters, but for reasonably modern hardware the
> serialization of the tests really handicaps you.  We seem to have
> fixed that for manual application of "make check-world", at least
> if you know the right magic incantations to parallelize it; but
> AFAIK the buildfarm script is entirely serial.
>
> BTW, speaking of "value proportionate to runtime", the hash_index
> regression script is now visibly a bottleneck in the core regression
> tests.  Could we take another look at whether that needs to run
> quite so long?
>

I have looked into the tests and I think we can do some optimization
without losing much on code coverage.  First is we are doing both
Vacuum Full and Vacuum on hash_split_heap in the same test after
executing few statements, it seems to me that we can avoid doing
Vacuum Full.  Also, I think we can try by reducing the number of
inserts in hash_split_heap to see if we can achieve the code coverage
of split bucket code path.  Finally, I am not sure if Reindex Index
hash_split_index is required for code coverage, but we might need it
for the sake of testing that operation.

Mithun, as you are the original author of these tests, can you please
try some of the above optimizations and any others you can think of
and see if we can reduce the time for hash index tests?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] ERROR: badly formatted node string "RESTRICTINFO...

2017-04-12 Thread Amit Kapila
On Wed, Apr 12, 2017 at 12:40 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>
>> However, the worker will
>> never execute such a plan as we don't generate a plan where unsafe
>> sublan/initplan is referenced in the node passed to the worker.  If we
>> want to avoid passing parallel-unsafe subplans to workers, then I
>> think we can maintain a list of parallel safe subplans along with
>> subplans in PlannerGlobal and PlannedStmt or maybe keep a parallel
>> safe flag in Plan so that we can pass only parallel safe plans to
>> workers.
>
> Right, we could, say, leave a hole in the subplan list corresponding
> to any subplan that's not parallel-safe.  That seems like a good idea
> anyway because right now there's clearly no cross-check preventing
> a worker from trying to run such a subplan.
>
> Anyone want to draft a patch for this?
>

Please find patch attached based on above discussion.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


push-parallel-safe-subplans-workers_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-04-11 Thread Amit Kapila
On Tue, Apr 11, 2017 at 10:50 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Tue, Apr 11, 2017 at 7:10 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>>
>>
>> Yes, and as Andres says, you don't help with those, and then you're
>> upset when your own patch doesn't get attention.
>
>
> I am not upset, I was obviously a bit disappointed which I think is a very
> natural emotion after spending weeks on it. I am not blaming any one
> individual (excluding myself) for that and neither the community at large
> for the outcome. And I've moved on. I know everyone is busy getting the
> release ready and I see no point discussing this endlessly. We have enough
> on our plates for next few weeks.
>
>>
>>
>>   Amit and others who have started to
>> dig into this patch a little bit found real problems pretty quickly
>> when they started digging.
>
>
> And I fixed them as quickly as humanly possible.
>

Yes, you have responded to them quickly, but I didn't get a chance to
re-verify all of those.  However, I think the main point Robert wants
to say is that somebody needs to dig the complete patch to see if
there is any kind of problems with it.

>>
>>   Those problems should be addressed, and
>> review should continue (from whatever source) until no more problems
>> can be found.
>
>
> Absolutely.
>
>>
>>  The patch may
>> or may not have any data-corrupting bugs, but regressions have been
>> found and not addressed.
>
>
> I don't know why you say that regressions are not addressed. Here are a few
> things I did to address the regressions/reviews/concerns, apart from fixing
> all the bugs discovered, but please let me know if there are things I've not
> addressed.
>
> 1. Improved the interesting attrs patch that Alvaro wrote to address the
> regression discovered in fetching more heap attributes. The patch that got
> committed in fact improved certain synthetic workloads over then master.
> 2. Based on Petr and your feedback, disabled WARM on toasted attributes to
> reduce overhead of fetching/decompressing the attributes.
> 3. Added code to avoid doing second index scan when the index does not
> contain any WARM pointers. This should address the situation Amit brought up
> where only one of the indexes receive WARM inserts.
> 4. Added code to kill wrong index pointers to do online cleanup.
> 5. Added code to set a CLEAR pointer to a WARM pointer when we know that the
> entire chain is WARM. This should address the workload Dilip ran and found
> regression (I don't think he got chance to confirm that)
>

Have you by any chance tried to reproduce it at your end?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] ERROR: badly formatted node string "RESTRICTINFO...

2017-04-11 Thread Amit Kapila
On Wed, Apr 12, 2017 at 12:40 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Kapila <amit.kapil...@gmail.com> writes:
>
>> I think there is a possibility of doing ExecInitNode in a parallel
>> worker for a parallel-unsafe subplan, because we pass a list of all
>> the sublans stored in planned statement.
>
> It's more than a possibility.  If you run Andreas' test case against
> HEAD, now that the can't-transmit-RestrictInfo failure is fixed,
> you will find that the parallel worker actually creates and immediately
> destroys an FDW remote connection.  (The easiest way to prove that is
> to turn on log_connections/log_disconnections.  BTW, is there any
> convenient way to attach a debugger to a parallel worker process as it's
> being launched?
>

What I generally do to debug parallel worker case is to add a "while
(1) {}" kind of loop in the beginning of ParallelQueryMain() or in
ParallelWorkerMain() depending on area I want to debug, like in this
case it would be okay to add such a loop in ParallelQueryMain().  Then
as the worker will be waiting there attach the debugger to that
process and resume debugging.  It is easier to identify the worker
process if we add an infinite loop, otherwise one can use sleep or
some other form of wait or debug break mechanism as well.

>  I couldn't manage to catch the worker in the act.)
>
> So I think this is clearly a Bad Thing and we'd better do something
> about it.  The consequences for postgres_fdw aren't so awful perhaps,
> but other non-parallel-safe FDWs might have bigger problems with this
> behavior.
>
>> However, the worker will
>> never execute such a plan as we don't generate a plan where unsafe
>> sublan/initplan is referenced in the node passed to the worker.  If we
>> want to avoid passing parallel-unsafe subplans to workers, then I
>> think we can maintain a list of parallel safe subplans along with
>> subplans in PlannerGlobal and PlannedStmt or maybe keep a parallel
>> safe flag in Plan so that we can pass only parallel safe plans to
>> workers.
>
> Right, we could, say, leave a hole in the subplan list corresponding
> to any subplan that's not parallel-safe.  That seems like a good idea
> anyway because right now there's clearly no cross-check preventing
> a worker from trying to run such a subplan.
>
> Anyone want to draft a patch for this?
>

Yes, I will work on it in this week and possibly today or tomorrow and
either produce a patch or if I face any problems, then will update
about them here.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] ERROR: badly formatted node string "RESTRICTINFO...

2017-04-11 Thread Amit Kapila
On Tue, Apr 11, 2017 at 5:29 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I wrote:
>> (BTW, I've not yet looked to see if this needs to be back-ported.)
>
> postgres_fdw will definitely include RestrictInfos in its fdw_private
> list in 9.6.  However, I've been unable to provoke a visible failure.
> After some rooting around, the reason seems to be that:
>
> 1. postgres_fdw doesn't mark its plans as parallel-safe --- it doesn't
> even have a IsForeignScanParallelSafe method.  So you'd think that one
> of its plans could never be transmitted to a parallel worker ... but:
>
> 2. Andreas' test query doesn't have the foreign scan in the main query
> tree, it's in an InitPlan.  9.6 did not transmit any subplans or initplans
> to parallel workers, but HEAD does.  So HEAD sends the illegal structure
> to the worker which spits up on trying to read a RestrictInfo.
>
> I think the fact that we see this problem at all may indicate an
> oversight in commit 5e6d8d2bbbcace304450b309e79366c0da4063e4 ("Allow
> parallel workers to execute subplans").  If the worker were to actually
> run the initplan, bad things would happen (the worker would create its
> own remote connection, which we don't want).  Now, in the problem plan
> the InitPlan is actually attached to the topmost Gather, which I think
> is safe because it'll be run by the master, but I wonder if we're being
> careful enough about non-parallel-safe plans for initplans/subplans.
>

Initplans are never marked parallel safe, only subplans that are
generated for parallel safe paths are marked as parallel safe.


> Also, even if the worker never actually executes the plan node, just
> doing ExecInitNode on it in a parallel worker might be more than a
> non-parallel-safe FDW is prepared to cope with.
>

I think there is a possibility of doing ExecInitNode in a parallel
worker for a parallel-unsafe subplan, because we pass a list of all
the sublans stored in planned statement.  However, the worker will
never execute such a plan as we don't generate a plan where unsafe
sublan/initplan is referenced in the node passed to the worker.  If we
want to avoid passing parallel-unsafe subplans to workers, then I
think we can maintain a list of parallel safe subplans along with
subplans in PlannerGlobal and PlannedStmt or maybe keep a parallel
safe flag in Plan so that we can pass only parallel safe plans to
workers.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange parallel query behavior after OOM crashes

2017-04-05 Thread Amit Kapila
On Wed, Apr 5, 2017 at 12:35 PM, Kuntal Ghosh
<kuntalghosh.2...@gmail.com> wrote:
> On Tue, Apr 4, 2017 at 11:22 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> On 04/04/2017 06:52 PM, Robert Haas wrote:
>>>
>>> On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2...@gmail.com>
>>> wrote:
>>>>
>>>> On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmh...@gmail.com>
>>>> wrote:
>>>>>
>>>>> On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
>>>>> <kuntalghosh.2...@gmail.com> wrote:
>>>>>>
>>>>>> 2. the server restarts automatically, initialize
>>>>>> BackgroundWorkerData->parallel_register_count and
>>>>>> BackgroundWorkerData->parallel_terminate_count in the shared memory.
>>>>>> After that, it calls ForgetBackgroundWorker and it increments
>>>>>> parallel_terminate_count.
>>>>>
>>>>>
>>>>> Hmm.  So this seems like the root of the problem.  Presumably those
>>>>> things need to be reset AFTER forgetting any background workers from
>>>>> before the crash.
>>>>>
>>>> IMHO, the fix would be not to increase the terminated parallel worker
>>>> count whenever ForgetBackgroundWorker is called due to a bgworker
>>>> crash. I've attached a patch for the same. PFA.
>>>
>>>
>>> While I'm not opposed to that approach, I don't think this is a good
>>> way to implement it.  If you want to pass an explicit flag to
>>> ForgetBackgroundWorker telling it whether or not it should performing
>>> the increment, fine.  But with what you've got here, you're
>>> essentially relying on "spooky action at a distance".  It would be
>>> easy for future code changes to break this, not realizing that
>>> somebody's got a hard dependency on 0 having a specific meaning.
>>>
>>
>> I'm probably missing something, but I don't quite understand how these
>> values actually survive the crash. I mean, what I observed is OOM followed
>> by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the
>> values back to 0? Or do we call ForgetBackgroundWorker() after the crash for
>> some reason?
> AFAICU, during crash recovery, we wait for all non-syslogger children
> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform
> StartupDataBase. While starting the startup process we check if any
> bgworker is scheduled for a restart.
>

In general, your theory appears right, but can you check how it
behaves in standby server because there is a difference in how the
startup process behaves during master and standby startup?  In master,
it stops after recovery whereas in standby it will keep on running to
receive WAL.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2017-04-04 Thread Amit Kapila
On Wed, Apr 5, 2017 at 3:36 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 27 March 2017 at 15:36, Beena Emerson <memissemer...@gmail.com> wrote:
>
>> 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes
>> the internal representation of max_wal_size and min_wal_size to mb.
>
> Committed first part to allow internal representation change (only).
>
> No commitment yet to increasing wal-segsize in the way this patch has it.
>

What part of patch you don't like and do you have any suggestions to
improve the same?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Scan Mode in Hash Index

2017-04-04 Thread Amit Kapila
On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
>
> My guess (which could be wrong) is that so->hashso_bucket_buf =
>> InvalidBuffer should be moved back up higher in the function where it
>> was before, just after the first if statement, and that the new
>> condition so->hashso_bucket_buf == so->currPos.buf on the last
>> _hash_dropbuf() should be removed.  If that's not right, then I think
>> I need somebody to explain why not.
>
> Okay, as i mentioned above, in case of page scan mode we only keep pin
> on a bucket buf. There won't be any case where we will be having pin
> on overflow buf at the end of scan.
>

What if mark the buffer as invalid after releasing the pin?  We are
already doing it that way in btree code, refer
_bt_drop_lock_and_maybe_pin().  I think if we do that way, then we can
do what Robert is suggesting.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Scan Mode in Hash Index

2017-04-04 Thread Amit Kapila
On Sun, Apr 2, 2017 at 4:14 AM, Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
>
> Please note that these patches needs to be applied on top of  [1].
>

Few more review comments:

1.
- page = BufferGetPage(so->hashso_curbuf);
+ blkno = so->currPos.currPage;
+ if (so->hashso_bucket_buf == so->currPos.buf)
+ {
+ buf = so->currPos.buf;
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+ }

Here, you are assuming that only bucket page can be pinned, but I
think that assumption is not right.  When _hash_kill_items() is called
before moving to next page, there could be a pin on the overflow page.
You need some logic to check if the buffer is pinned, then just lock
it.  I think once you do that, it might not be convinient to release
the pin at the end of this function.

Add some comments on top of _hash_kill_items to explain the new
changes or say some thing like "See _bt_killitems for details"

2.
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later.  This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = PageGetLSN(page);
+

The second part of above comment doesn't make sense because vacuum's
will still be blocked because we hold the pin on primary bucket page.

3.
+ {
+ /*
+ * No more matching tuples were found. return FALSE
+ * indicating the same. Also, remember the prev and
+ * next block number so that if fetching tuples using
+ * cursor we remember the page from where to start the
+ * scan.
+ */
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

You can't read opaque without having lock and by this time it has
already been released.  Also, I think if you want to save position for
cursor movement, then you need to save the position of last bucket
when scan completes the overflow chain, however as you have written it
will be always invalid block number. I think there is similar problem
with prevblock number.

4.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ IndexTuple  itup;
+ int itemIndex;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* load items[] in ascending order */
+ itemIndex = 0;
+
+ /* new page, relocate starting position by binary search */
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);

What is the need to find offset number when this function already has
that as an input parameter?

5.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)

I think maxoff is not required to be passed a parameter to this
function as you need it only for forward scan. You can compute it
locally.

6.
+_hash_load_qualified_items(IndexScanDesc scan, Page page, OffsetNumber offnum,
+   OffsetNumber maxoff, ScanDirection dir)
{
..
+ if (ScanDirectionIsForward(dir))
+ {
..
+ while (offnum <= maxoff)
{
..
+ if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
+ _hash_checkqual(scan, itup))
+ {
+ /* tuple is qualified, so remember it */
+ _hash_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+
+ offnum = OffsetNumberNext(offnum);
..

Why are you traversing the whole page even when there is no match?
There is a similar problem in backward scan. I think this is blunder.

7.
+ if (so->currPos.buf == so->hashso_bucket_buf ||
+ so->currPos.buf == so->hashso_split_bucket_buf)
+ {
+ so->currPos.prevPage = InvalidBlockNumber;
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ so->currPos.prevPage = (opaque)->hasho_prevblkno;
+ _hash_relbuf(rel, so->currPos.buf);
+ }
+
+ so->currPos.nextPage = (opaque)->hasho_nextblkno;

What makes you think it is safe read opaque after releasing the lock?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-04-01 Thread Amit Kapila
On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma <ashu.coe...@gmail.com> 
> wrote:
>> Well, That is another option but the main idea was to be inline with
>> the btree code.
>
> That's not a bad goal in principal, but _bt_killitems() doesn't have
> any similar argument.
>
> (Also, speaking of consistency, why did we end up with
> _hash_kill_items, with an underscore between kill and items, and
> _bt_killitems, without one?)
>
>> Moreover, I am not sure if acquiring lwlock inside
>> hashendscan (basically the place where we are trying to close down the
>> things) would look good.
>
> Well, if that's not a good thing to do, hiding it inside some other
> function doesn't make it better.  I think it's fine, though.
>

One thing to note here is that this fix won't be required if we get
the page-scan-mode patch [1] in this CF.  I think if we fix this with
the patch proposed by Ashutosh, then we anyway needs to again change
the related code (kind of revert the fix) after page-scan-mode patch.
Now, if you think we have negligible chance of getting that patch,
then it is good to proceed with this fix.

[1] - https://commitfest.postgresql.org/13/999/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Scan Mode in Hash Index

2017-04-01 Thread Amit Kapila
On Mon, Mar 27, 2017 at 7:04 PM, Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
>>
>> I am suspicious that _hash_kill_items() is going to have problems if
>> the overflow page is freed before it reacquires the lock.
>> _btkillitems() contains safeguards against similar cases.
>
> I have added similar check for hash_kill_items() as well.
>

I think hash_kill_items has a much bigger problem which is that we
can't kill the items if the page has been modified after re-reading
it.  Consider a case where Vacuum starts before the Scan on the bucket
and then Scan went ahead (which is possible after your 0003 patch) and
noted the killed items in killed item array and before it could kill
all the items, Vacuum remove those items.  Now it is quite possible
that before scan tries to kill those items, the corresponding itemids
have been used by different tuples.  I think what we need to do here
is to store the LSN of page first time when we have read the page and
then compare it with current page lsn after re-reading the page in
hash_kill_tems.

*
+ HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+} HashScanPosData;
..

HashScanPosItem *killedItems; /* tids and offset numbers of killed items */
  int numKilled; /* number of currently stored items */
+
+ /*
+ * Identify all the matching items on a page and save them
+ * in HashScanPosData
+ */
+ HashScanPosData currPos; /* current position data */
 } HashScanOpaqueData;


After having array of HashScanPosItems as currPos.items, I think you
don't need array of HashScanPosItem for killedItems, just an integer
array of index in currPos.items should be sufficient.


*
I think below para in README is not valid after this patch.

"To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to
re-find
their current scan position after re-acquiring the buffer content lock
on page.  Since deletion is not possible while a reader holds the pin
on bucket, and we assume that heap tuple TIDs are unique, this can be
implemented by searching for the same heap tuple TID previously
returned.  Insertion does not move index entries across pages, so the
previously-returned index entry should always be on the same page, at
the same or higher offset number,
as it was before."

*
- next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
-   LH_OVERFLOW_PAGE,
-   bstrategy);
-

  /*
- * release the lock on previous page after acquiring the lock on next
- * page
+ * As the hash index scan work in page at a time mode,
+ * vacuum can release the lock on previous page before
+ * acquiring lock on the next page.
  */
..
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+   LH_OVERFLOW_PAGE,
+   bstrategy);
+

After this change, you need to modify comments on top of function
hashbucketcleanup() and _hash_squeezebucket().



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-04-01 Thread Amit Kapila
On Fri, Mar 31, 2017 at 11:54 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
> On Fri, Mar 31, 2017 at 11:16 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>
>>   Now, I understand you to be suggesting a flag at
>> table-creation time that would, maybe, be immutable after that, but
>> even then - are we going to run completely unmodified 9.6 code for
>> tables where that's not enabled, and only go through any of the WARM
>> logic when it is enabled?  Doesn't sound likely.  The commits already
>> made from this patch series certainly affect everybody, and I can't
>> see us adding switches that bypass
>> ce96ce60ca2293f75f36c3661e4657a3c79ffd61 for example.
>
>
> I don't think I am going to claim that either. But probably only 5% of the
> new code would then be involved. Which is a lot less and a lot more
> manageable. Having said that, I think if we at all do this, we should only
> do it based on our experiences in the beta cycle, as a last resort. Based on
> my own experiences during HOT development, long running pgbench tests, with
> several concurrent clients, subjected to multiple AV cycles and periodic
> consistency checks, usually brings up issues related to heap corruption. So
> my confidence level is relatively high on that part of the code. That's not
> to suggest that there can't be any bugs.
>
> Obviously then there are other things such as regression to some workload or
> additional work required by vacuum etc. And I think we should address them
> and I'm fairly certain we can do that. It may not happen immediately, but if
> we provide right knobs, may be those who are affected can fall back to the
> old behaviour or not use the new code at all while we improve things for
> them.
>

Okay, but even if we want to provide knobs, then there should be some
consensus on those.  I am sure introducing an additional pass over
index has some impact so either we should have some way to reduce the
impact or have some additional design to handle it.  Do you think it
make sense to have a separate thread to discuss and get feedback on
same as I am not seeing much input on the knobs you are proposing to
handle second pass over index?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

2017-03-31 Thread Amit Kapila
On Thu, Mar 30, 2017 at 8:24 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Mar 29, 2017 at 8:00 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> What is however strange is that changing max_parallel_workers_per_gather
>> affects row estimates *above* the Gather node. That seems a bit, um,
>> suspicious, no? See the parallel-estimates.log.
>
> Thanks for looking at this!  Comparing the parallel plan vs. the
> non-parallel plan:
>
> part: parallel rows (after Gather) 20202, non-parallel rows 20202
> partsupp: parallel rows 18, non-parallel rows 18
> part-partsupp join: parallel rows 88988, non-parallel rows 355951
> lineitem: parallel rows 59986112, non-parallel rows 59986112
> lineitem after aggregation: parallel rows 5998611, non-parallel rows 5998611
> final join: parallel rows 131, non-parallel rows 524
>
> I agree with you that that looks mighty suspicious.  Both the
> part-partsupp join and the final join have exactly 4x as many
> estimated rows in the non-parallel plan as in the parallel plan, and
> it just so happens that the parallel divisor here will be 4.
>
> Hmm... it looks like the parallel_workers value from the Gather node
> is being erroneously propagated up to the higher levels of the plan
> tree.   Wow.   Somehow, Gather Merge managed to get the logic correct
> here, but Gather is totally wrong.  Argh.   Attached is a draft patch,
> which I haven't really tested beyond checking that it passes 'make
> check'.
>

Your patch looks good to me.  I have verified some join cases as well
where the behaviour is sane after patch.  I have also done testing
with force_parallel_mode=regress (ran make check-world) and everything
seems good.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Supporting huge pages on Windows

2017-03-31 Thread Amit Kapila
On Fri, Mar 31, 2017 at 8:05 AM, Tsunakawa, Takayuki
<tsunakawa.ta...@jp.fujitsu.com> wrote:
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
>> The latest patch looks good to me apart from one Debug message, so I have
>> marked it as Ready For Committer.
>
> Thank you so much!
>
>
>> + ereport(DEBUG1,
>> + (errmsg("disabling huge pages")));
>>
>> I think this should be similar to what we display in sysv_shmem.c as below:
>>
>> elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
>> allocsize);
>
> I understood you suggested this to make the reason clear for disabling huge 
> pages.  OK, done as follows.
>
> +   elog(DEBUG1, "CreateFileMapping(%llu) with SEC_LARGE_PAGES 
> failed "
> +"due to insufficient large pages, huge pages disabled",
> +size);
>

You should use %zu instead of %llu to print Size type of variable.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-30 Thread Amit Kapila
On Thu, Mar 30, 2017 at 5:55 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
> On Thu, Mar 30, 2017 at 5:27 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>>
>>
>> How have you verified that?  Have you checked that in
>> heap_prepare_insert it has called toast_insert_or_update() and then
>> returned a tuple different from what the input tup is?  Basically, I
>> am easily able to see it and even the reason why the heap and index
>> tuples will be different.  Let me try to explain,
>> toast_insert_or_update returns a new tuple which contains compressed
>> data and this tuple is inserted in heap where as slot still refers to
>> original tuple (uncompressed one) which is passed to heap_insert.
>> Now, ExecInsertIndexTuples and the calls under it like FormIndexDatum
>> will refer to the tuple in slot which is uncompressed and form the
>> values[] using uncompressed value.
>
>
> Ah, yes. You're right. Not sure why I saw things differently. That doesn't
> anything though because during recheck we'll get compressed value and not do
> anything with it. In the index we already have compressed value and we can
> compare them. Even if we decide to decompress everything and do the
> comparison, that should be possible.
>

I think we should not consider doing compression and decompression as
free at this point in code, because we hold a buffer lock during
recheck. Buffer locks are meant for short-term locks (it is even
mentioned in storage/buffer/README), doing all the
compression/decompression/detoast stuff under these locks doesn't
sound advisable to me.  It can block many concurrent operations.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-30 Thread Amit Kapila
On Thu, Mar 30, 2017 at 4:07 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
> On Wed, Mar 29, 2017 at 4:42 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>> On Wed, Mar 29, 2017 at 1:10 PM, Pavan Deolasee
>> <pavan.deola...@gmail.com> wrote:
>> >
>> > On Wed, Mar 29, 2017 at 12:02 PM, Amit Kapila <amit.kapil...@gmail.com>
>> > wrote:
>> >>
>> >> On Wed, Mar 29, 2017 at 11:52 AM, Amit Kapila <amit.kapil...@gmail.com>
>> >> wrote:
>> >
>> > Then during recheck, we pass already compressed values to
>> > index_form_tuple(). But my point is, the following code will ensure that
>> > we
>> > don't compress it again. My reading is that the first check for
>> > !VARATT_IS_EXTENDED will return false if the value is already
>> > compressed.
>> >
>>
>> You are right.  I was confused with previous check of VARATT_IS_EXTERNAL.
>>
>
> Ok, thanks.
>
>>
>> >
>> > TBH I couldn't find why the original index insertion code will always
>> > supply
>> > uncompressed values.
>> >
>>
>> Just try by inserting large value of text column ('aa.bbb')
>> upto 2.5K.  Then have a breakpoint in heap_prepare_insert and
>> index_form_tuple, and debug both the functions, you can find out that
>> even though we compress during insertion in heap, the index will
>> compress the original value again.
>>
>
> Ok, tried that. AFAICS index_form_tuple gets compressed values.
>

How have you verified that?  Have you checked that in
heap_prepare_insert it has called toast_insert_or_update() and then
returned a tuple different from what the input tup is?  Basically, I
am easily able to see it and even the reason why the heap and index
tuples will be different.  Let me try to explain,
toast_insert_or_update returns a new tuple which contains compressed
data and this tuple is inserted in heap where as slot still refers to
original tuple (uncompressed one) which is passed to heap_insert.
Now, ExecInsertIndexTuples and the calls under it like FormIndexDatum
will refer to the tuple in slot which is uncompressed and form the
values[] using uncompressed value.  Try with a simple case as below:

Create table t_comp(c1 int, c2 text);
Create index idx_t_comp_c2 on t_comp(c2);
Create index idx_t_comp_c1 on t_comp(c1);

Insert into t_comp(1,' ...aaa');

Repeat 'a' in above line for 2700 times or so.  You should notice what
I am explaining above.


>>
>>
>> Yeah probably you are right, but I am not sure if it is good idea to
>> compare compressed values.
>>
>
> Again, I don't see a problem there.
>
>>
>> I think with this new changes in btrecheck, it would appear to be much
>> costlier as compare to what you have few versions back.  I am afraid
>> that it can impact performance for cases where there are few WARM
>> updates in chain and many HOT updates as it will run recheck for all
>> such updates.
>
>
>
> INSERT INTO pgbench_accounts SELECT generate_series(:start, :end )::text ||
> <2300 chars string>, (random()::bigint) % :scale, 0;
>
> CREATE UNIQUE INDEX pgb_a_aid ON pgbench_accounts(aid);
> CREATE INDEX pgb_a_filler1 ON pgbench_accounts(filler1);
> CREATE INDEX pgb_a_filler2 ON pgbench_accounts(filler2);
> CREATE INDEX pgb_a_filler3 ON pgbench_accounts(filler3);
> CREATE INDEX pgb_a_filler4 ON pgbench_accounts(filler4);
>
> -- Force a WARM update on one row
> UPDATE pgbench_accounts SET filler1 = 'X' WHERE aid = '100' ||
> repeat('abcdefghij', 2);
>
> Test:
> -- Fetch the row using the fat index. Since the row contains a
> BEGIN;
> SELECT substring(aid, 1, 10) FROM pgbench_accounts WHERE aid = '100' ||
> <2300 chars string> ORDER BY aid;
> UPDATE pgbench_accounts SET abalance = abalance + 100 WHERE aid = '100' ||
> <2300 chars string>;
> END;
>
> I did 4 5-minutes runs with master and WARM and there is probably a 2-3%
> regression.
>

So IIUC, in above test during initialization you have one WARM update
and then during actual test all are HOT updates, won't in such a case
the WARM chain will be converted to HOT by vacuum and then all updates
from thereon will be HOT and probably no rechecks?

> (Results with 5 mins tests, txns is total for 5 mins, idx_scan is number of
> scans on the fat index)
> master:
> txns  idx_scan
> 414117 828233
> 411109 822217
> 411848 823695
> 408424 816847
>
> WARM:
> txns   idx_scan
> 404139 808277
> 398880 797759
> 399949 799897
> 397927 795853
>
> ==
>
> I then also repeated the tests, but this time using compressible values. The
> reg

Re: [HACKERS] Supporting huge pages on Windows

2017-03-30 Thread Amit Kapila
On Thu, Mar 9, 2017 at 7:06 AM, Tsunakawa, Takayuki
<tsunakawa.ta...@jp.fujitsu.com> wrote:
> From: pgsql-hackers-ow...@postgresql.org
>> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Ashutosh Sharma
>> To start with, I ran the regression test-suite and didn't find any failures.
>> But, then I am not sure if huge_pages are getting used or not. However,
>> upon checking the settings for huge_pages and I found it as 'on'. I am
>> assuming, if huge pages is not being used due to shortage of large pages,
>> it should have fallen back to non-huge pages.
>
> You are right, the server falls back to non-huge pages when the large pages 
> run short.
>

The latest patch looks good to me apart from one Debug message, so I
have marked it as Ready For Committer.

+ ereport(DEBUG1,
+ (errmsg("disabling huge pages")));

I think this should be similar to what we display in sysv_shmem.c as below:

elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
allocsize);


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inconsistent page found on STANDBY server

2017-03-30 Thread Amit Kapila
On Thu, Mar 30, 2017 at 1:55 PM, Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
>
> RCA:
> 
> After debugging the hash index code for deletion, I could find that
> while registering data for xlog record 'XLOG_HASH_UPDATE_META_PAGE' we
> are not passing the correct length of data being registered and
> therefore, data (xl_hash_update_meta_page) is not completely recorded
> into the wal record.
>
> Fix:
> ===
> Attached patch fixes this issue.
>

The fix looks good to me.  I have scanned the hash index code to see
if there is any other similar problem, but couldn't find any.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-29 Thread Amit Kapila
On Wed, Mar 29, 2017 at 1:10 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
> On Wed, Mar 29, 2017 at 12:02 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>> On Wed, Mar 29, 2017 at 11:52 AM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>
> Then during recheck, we pass already compressed values to
> index_form_tuple(). But my point is, the following code will ensure that we
> don't compress it again. My reading is that the first check for
> !VARATT_IS_EXTENDED will return false if the value is already compressed.
>

You are right.  I was confused with previous check of VARATT_IS_EXTERNAL.

>
> TBH I couldn't find why the original index insertion code will always supply
> uncompressed values.
>

Just try by inserting large value of text column ('aa.bbb')
upto 2.5K.  Then have a breakpoint in heap_prepare_insert and
index_form_tuple, and debug both the functions, you can find out that
even though we compress during insertion in heap, the index will
compress the original value again.

> But even if does, and even if the recheck gets it in
> compressed form, I don't see how we will double-compress that.
>

No as I agreed above, it won't double-compress, but still looks
slightly risky to rely on different set of values passed to
index_form_tuple and then compare them.

> As far as, comparing two compressed values go, I don't see a problem there.
> Exact same compressed values should decompress to exact same value. So
> comparing two compressed values and two uncompressed values should give us
> the same result.
>

Yeah probably you are right, but I am not sure if it is good idea to
compare compressed values.

I think with this new changes in btrecheck, it would appear to be much
costlier as compare to what you have few versions back.  I am afraid
that it can impact performance for cases where there are few WARM
updates in chain and many HOT updates as it will run recheck for all
such updates.  Did we any time try to measure the performance of cases
like that?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-29 Thread Amit Kapila
On Wed, Mar 29, 2017 at 12:51 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> On Wed, Mar 29, 2017 at 10:12 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Few other comments:
>> +/*
>> + * This is just a trick to save a division operation. If you look into the
>> + * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will 
>> indicate
>> + * which phase of allocation the bucket_num belongs to with in the group. 
>> This
>> + * is because at every splitpoint group we allocate (2 ^ x) buckets and we 
>> have
>> + * divided the allocation process into 4 equal phases. This macro returns 
>> value
>> + * from 0 to 3.
>> + */
>>
>> +#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
>> + (((bucket_num) >> (sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)) & \
>> + SPLITPOINT_PHASE_MASK)
>> This won't work if we change SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE to
>> number other than 3.  I think you should change it so that it can work
>> with any value of  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE.
>
> Fixed, using SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE was accidental. All
> I need is most significant 3 bits hence should be subtracted by 3
> always.
>

Okay, your current patch looks good to me apart from minor comments,
so marked as Read For Committer.  Please either merge the
sort_hash_b_2.patch with main patch or submit it along with next
revision for easier reference.

Few minor comments:
1.
+splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
+(which belong to splitpoint group 0's phase 1 and phase 2 respectively) always
+appear at block numbers 1 and 2, just after the meta page.

This explanation doesn't seem to be right as with current patch we
start phased allocation only after splitpoint group 9.

2.
-#define HASH_MAX_SPLITPOINTS 32
 #define HASH_MAX_BITMAPS 128

+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE 10
+#define HASH_MAX_SPLITPOINTS \
+ ((32 - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) * \
+ SPLITPOINT_PHASES_PER_GRP) + \
+ SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE

You have changed the value of HASH_MAX_SPLITPOINTS, but the comments
explaining that value are still unchanged.  Refer below text.
"The limitation on the size of spares[] comes from the fact that there's
 * no point in having more than 2^32 buckets with only uint32 hashcodes."



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-29 Thread Amit Kapila
On Tue, Mar 28, 2017 at 10:31 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
> On Tue, Mar 28, 2017 at 4:05 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>> As asked previously, can you explain me on what basis are you
>> considering it robust?  The comments on top of datumIsEqual() clearly
>> indicates the danger of using it for toasted values (Also, it will
>> probably not give the answer you want if either datum has been
>> "toasted".).
>
>
> Hmm. I don' see why the new code in recheck is unsafe. The index values
> themselves can't be toasted (IIUC), but they can be compressed.
> index_form_tuple() already untoasts any toasted heap attributes and
> compresses them if needed. So once we pass heap values via
> index_form_tuple() we should have exactly the same index values as they were
> inserted. Or am I missing something obvious here?
>

I don't think relying on datum comparison for compressed values from
heap and index is safe (even after you try to form index tuple from
heap value again during recheck) and I have mentioned one of the
hazards of doing so upthread.  Do you see any place else where we rely
on comparison of compressed values?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-29 Thread Amit Kapila
On Wed, Mar 29, 2017 at 11:52 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Tue, Mar 28, 2017 at 10:35 PM, Pavan Deolasee
> <pavan.deola...@gmail.com> wrote:
>>
>> On Tue, Mar 28, 2017 at 7:04 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>>>
>>>   For such an heap insert, we will pass
>>> the actual value of column to index_form_tuple during index insert.
>>> However during recheck when we fetch the value of c2 from heap tuple
>>> and pass it index tuple, the value is already in compressed form and
>>> index_form_tuple might again try to compress it because the size will
>>> still be greater than TOAST_INDEX_TARGET and if it does so, it might
>>> make recheck fail.
>>>
>>
>> Would it? I thought  "if
>> (!VARATT_IS_EXTENDED(DatumGetPointer(untoasted_values[i]))" check should
>> prevent that. But I could be reading those macros wrong. They are probably
>> heavily uncommented and it's not clear what each of those VARATT_* macro do.
>>
>
> That won't handle the case where it is simply compressed.  You need
> check like VARATT_IS_COMPRESSED to take care of compressed heap
> tuples, but then also it won't work because heap_tuple_fetch_attr()
> doesn't handle compressed tuples.  You need to use
> heap_tuple_untoast_attr() to handle the compressed case.  Also, we
> probably need to handle other type of var attrs.  Now, If we want to
> do all of that index_form_tuple()  might not be the right place, we
> probably want to handle it in caller or provide an alternate API.
>

Another related, index_form_tuple() has a check for VARATT_IS_EXTERNAL
not VARATT_IS_EXTENDED, so may be that is cause of confusion for you,
but as I mentioned even if you change the check heap_tuple_fetch_attr
won't suffice the need.




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-29 Thread Amit Kapila
On Tue, Mar 28, 2017 at 10:35 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
> On Tue, Mar 28, 2017 at 7:04 PM, Amit Kapila <amit.kapil...@gmail.com>
> wrote:
>>
>>
>>
>>   For such an heap insert, we will pass
>> the actual value of column to index_form_tuple during index insert.
>> However during recheck when we fetch the value of c2 from heap tuple
>> and pass it index tuple, the value is already in compressed form and
>> index_form_tuple might again try to compress it because the size will
>> still be greater than TOAST_INDEX_TARGET and if it does so, it might
>> make recheck fail.
>>
>
> Would it? I thought  "if
> (!VARATT_IS_EXTENDED(DatumGetPointer(untoasted_values[i]))" check should
> prevent that. But I could be reading those macros wrong. They are probably
> heavily uncommented and it's not clear what each of those VARATT_* macro do.
>

That won't handle the case where it is simply compressed.  You need
check like VARATT_IS_COMPRESSED to take care of compressed heap
tuples, but then also it won't work because heap_tuple_fetch_attr()
doesn't handle compressed tuples.  You need to use
heap_tuple_untoast_attr() to handle the compressed case.  Also, we
probably need to handle other type of var attrs.  Now, If we want to
do all of that index_form_tuple()  might not be the right place, we
probably want to handle it in caller or provide an alternate API.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-28 Thread Amit Kapila
On Tue, Mar 28, 2017 at 8:56 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Mar 28, 2017 at 5:00 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
>>> This will go wrong for split point group zero.  In general, I feel if
>>> you handle computation for split groups lesser than
>>> SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
>>> macros will become much simpler and less error prone.
>>
>> Fixed, apart from SPLITPOINT_PHASE_TO_SPLITPOINT_GRP rest all macros
>> only handle multi phase group. The SPLITPOINT_PHASE_TO_SPLITPOINT_GRP
>> is used in one more place at expand index so thought kepeping it as it
>> is is better.
>
> I wonder if we should consider increasing
> SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE somewhat.  For example, split
> point 4 is responsible for allocating only 16 new buckets = 128kB;
> doing those in four groups of two (16kB) seems fairly pointless.
> Suppose we start applying this technique beginning around splitpoint 9
> or 10.  Breaking 1024 new buckets * 8kB = 8MB of index growth into 4
> phases might save enough to be worthwhile.
>

10 sounds better point to start allocating in phases.

> Of course, even if we decide to apply this even for very small
> splitpoints, it probably doesn't cost us anything other than some
> space in the metapage.  But maybe saving space in the metapage isn't
> such a bad idea anyway.
>

Yeah metapage space is scarce, so lets try to save it as much possible.


Few other comments:
+/*
+ * This is just a trick to save a division operation. If you look into the
+ * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will indicate
+ * which phase of allocation the bucket_num belongs to with in the group. This
+ * is because at every splitpoint group we allocate (2 ^ x) buckets and we have
+ * divided the allocation process into 4 equal phases. This macro returns value
+ * from 0 to 3.
+ */

+#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
+ (((bucket_num) >> (sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)) & \
+ SPLITPOINT_PHASE_MASK)

This won't work if we change SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE to
number other than 3.  I think you should change it so that it can work
with any value of  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE.

I think you should name this define as SPLITPOINT_PHASE_WITHIN_GROUP
as this refers to only one particular phase within group.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-28 Thread Amit Kapila
On Tue, Mar 28, 2017 at 4:05 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Mon, Mar 27, 2017 at 2:19 PM, Pavan Deolasee
> <pavan.deola...@gmail.com> wrote:
>> On Fri, Mar 24, 2017 at 11:49 PM, Pavan Deolasee <pavan.deola...@gmail.com>
>> wrote:
>>>
>>>
>>>
>>> On Fri, Mar 24, 2017 at 6:46 PM, Amit Kapila <amit.kapil...@gmail.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> I was worried for the case if the index is created non-default
>>>> collation, will the datumIsEqual() suffice the need.  Now again
>>>> thinking about it, I think it will because in the index tuple we are
>>>> storing the value as in heap tuple.  However today it occurred to me
>>>> how will this work for toasted index values (index value >
>>>> TOAST_INDEX_TARGET).  It is mentioned on top of datumIsEqual() that it
>>>> probably won't work for toasted values.  Have you considered that
>>>> point?
>>>>
>>>
>>> No, I haven't and thanks for bringing that up. And now that I think more
>>> about it and see the code, I think the naive way of just comparing index
>>> attribute value against heap values is probably wrong. The example of
>>> TOAST_INDEX_TARGET is one such case, but I wonder if there are other varlena
>>> attributes that we might store differently in heap and index. Like
>>> index_form_tuple() ->heap_fill_tuple seem to some churning for varlena. It's
>>> not clear to me if index_get_attr will return the values which are binary
>>> comparable to heap values.. I wonder if calling index_form_tuple on the heap
>>> values, fetching attributes via index_get_attr on both index tuples and then
>>> doing a binary compare is a more robust idea. Or may be that's just
>>> duplicating efforts.
>>>
>>> While looking at this problem, it occurred to me that the assumptions made
>>> for hash indexes are also wrong :-( Hash index has the same problem as
>>> expression indexes have. A change in heap value may not necessarily cause a
>>> change in the hash key. If we don't detect that, we will end up having two
>>> hash identical hash keys with the same TID pointer. This will cause the
>>> duplicate key scans problem since hashrecheck will return true for both the
>>> hash entries. That's a bummer as far as supporting WARM for hash indexes is
>>> concerned, unless we find a way to avoid duplicate index entries.
>>>
>>
>> Revised patches are attached. I've added a few more regression tests which
>> demonstrates the problems with compressed and toasted attributes. I've now
>> implemented the idea of creating index tuple from heap values before doing
>> binary comparison using datumIsEqual. This seems to work ok and I see no
>> reason this should not be robust.
>>
>
> As asked previously, can you explain me on what basis are you
> considering it robust?  The comments on top of datumIsEqual() clearly
> indicates the danger of using it for toasted values (Also, it will
> probably not give the answer you want if either datum has been
> "toasted".).  If you think that because we are using it during
> heap_update to find modified columns, then I think that is not right
> comparison, because there we are comparing compressed value (of old
> tuple) with uncompressed value (of new tuple) which should always give
> result as false.
>


Yet another point to think about the recheck implementation is will it
work correctly when heap tuple itself is toasted.  Consider a case
where table has integer and text column (t1 (c1 int, c2 text)) and we
have indexes on c1 and c2 columns.  Now, insert a tuple such that the
text column has value more than 2 or 3K which will make it stored in
compressed form in heap (and the size of compressed value is still
more than TOAST_INDEX_TARGET).  For such an heap insert, we will pass
the actual value of column to index_form_tuple during index insert.
However during recheck when we fetch the value of c2 from heap tuple
and pass it index tuple, the value is already in compressed form and
index_form_tuple might again try to compress it because the size will
still be greater than TOAST_INDEX_TARGET and if it does so, it might
make recheck fail.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-28 Thread Amit Kapila
On Mon, Mar 27, 2017 at 2:19 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Fri, Mar 24, 2017 at 11:49 PM, Pavan Deolasee <pavan.deola...@gmail.com>
> wrote:
>>
>>
>>
>> On Fri, Mar 24, 2017 at 6:46 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>>>
>>>
>>>
>>> I was worried for the case if the index is created non-default
>>> collation, will the datumIsEqual() suffice the need.  Now again
>>> thinking about it, I think it will because in the index tuple we are
>>> storing the value as in heap tuple.  However today it occurred to me
>>> how will this work for toasted index values (index value >
>>> TOAST_INDEX_TARGET).  It is mentioned on top of datumIsEqual() that it
>>> probably won't work for toasted values.  Have you considered that
>>> point?
>>>
>>
>> No, I haven't and thanks for bringing that up. And now that I think more
>> about it and see the code, I think the naive way of just comparing index
>> attribute value against heap values is probably wrong. The example of
>> TOAST_INDEX_TARGET is one such case, but I wonder if there are other varlena
>> attributes that we might store differently in heap and index. Like
>> index_form_tuple() ->heap_fill_tuple seem to some churning for varlena. It's
>> not clear to me if index_get_attr will return the values which are binary
>> comparable to heap values.. I wonder if calling index_form_tuple on the heap
>> values, fetching attributes via index_get_attr on both index tuples and then
>> doing a binary compare is a more robust idea. Or may be that's just
>> duplicating efforts.
>>
>> While looking at this problem, it occurred to me that the assumptions made
>> for hash indexes are also wrong :-( Hash index has the same problem as
>> expression indexes have. A change in heap value may not necessarily cause a
>> change in the hash key. If we don't detect that, we will end up having two
>> hash identical hash keys with the same TID pointer. This will cause the
>> duplicate key scans problem since hashrecheck will return true for both the
>> hash entries. That's a bummer as far as supporting WARM for hash indexes is
>> concerned, unless we find a way to avoid duplicate index entries.
>>
>
> Revised patches are attached.
>

Noted few cosmetic issues in 0005_warm_updates_v21:

1.
pruneheap.c(939): warning C4098: 'heap_get_root_tuples' : 'void'
function returning a value

2.
+ *  HCWC_WARM_UPDATED_TUPLE - a tuple with HEAP_WARM_UPDATED is found somewhere
+ *in the chain. Note that when a tuple is WARM
+ *updated, both old and new versions are marked
+ *with this flag/
+ *
+ *  HCWC_WARM_TUPLE  - a tuple with HEAP_WARM_TUPLE is found somewhere in
+ *  the chain.
+ *
+ *  HCWC_CLEAR_TUPLE - a tuple without HEAP_WARM_TUPLE is found somewhere in
+ *   the chain.

Description of all three flags is same.

3.
+ *  HCWC_WARM_UPDATED_TUPLE - a tuple with HEAP_WARM_UPDATED is found somewhere
+ *in the chain. Note that when a tuple is WARM
+ *updated, both old and new versions are marked
+ *with this flag/

Spurious '/' at end of line.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-28 Thread Amit Kapila
On Mon, Mar 27, 2017 at 2:19 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
> On Fri, Mar 24, 2017 at 11:49 PM, Pavan Deolasee <pavan.deola...@gmail.com>
> wrote:
>>
>>
>>
>> On Fri, Mar 24, 2017 at 6:46 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>>>
>>>
>>>
>>> I was worried for the case if the index is created non-default
>>> collation, will the datumIsEqual() suffice the need.  Now again
>>> thinking about it, I think it will because in the index tuple we are
>>> storing the value as in heap tuple.  However today it occurred to me
>>> how will this work for toasted index values (index value >
>>> TOAST_INDEX_TARGET).  It is mentioned on top of datumIsEqual() that it
>>> probably won't work for toasted values.  Have you considered that
>>> point?
>>>
>>
>> No, I haven't and thanks for bringing that up. And now that I think more
>> about it and see the code, I think the naive way of just comparing index
>> attribute value against heap values is probably wrong. The example of
>> TOAST_INDEX_TARGET is one such case, but I wonder if there are other varlena
>> attributes that we might store differently in heap and index. Like
>> index_form_tuple() ->heap_fill_tuple seem to some churning for varlena. It's
>> not clear to me if index_get_attr will return the values which are binary
>> comparable to heap values.. I wonder if calling index_form_tuple on the heap
>> values, fetching attributes via index_get_attr on both index tuples and then
>> doing a binary compare is a more robust idea. Or may be that's just
>> duplicating efforts.
>>
>> While looking at this problem, it occurred to me that the assumptions made
>> for hash indexes are also wrong :-( Hash index has the same problem as
>> expression indexes have. A change in heap value may not necessarily cause a
>> change in the hash key. If we don't detect that, we will end up having two
>> hash identical hash keys with the same TID pointer. This will cause the
>> duplicate key scans problem since hashrecheck will return true for both the
>> hash entries. That's a bummer as far as supporting WARM for hash indexes is
>> concerned, unless we find a way to avoid duplicate index entries.
>>
>
> Revised patches are attached. I've added a few more regression tests which
> demonstrates the problems with compressed and toasted attributes. I've now
> implemented the idea of creating index tuple from heap values before doing
> binary comparison using datumIsEqual. This seems to work ok and I see no
> reason this should not be robust.
>

As asked previously, can you explain me on what basis are you
considering it robust?  The comments on top of datumIsEqual() clearly
indicates the danger of using it for toasted values (Also, it will
probably not give the answer you want if either datum has been
"toasted".).  If you think that because we are using it during
heap_update to find modified columns, then I think that is not right
comparison, because there we are comparing compressed value (of old
tuple) with uncompressed value (of new tuple) which should always give
result as false.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-28 Thread Amit Kapila
On Tue, Mar 28, 2017 at 10:43 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>
>
> As you have said we can solve it if we allocate buckets always in
> power-of-2 when we do hash index meta page init. But on other
> occasions, when we try to double the existing buckets we can do the
> allocation in 4 equal phases.
>
> But I think there are 2 more ways to solve same,
>
> A. Why not pass all 3 parameters high_mask, low_mask, max-buckets to
> tuplesort and let them use _hash_hashkey2bucket to figure out which
> key belong to which bucket. And then sort them. I think this way we
> make both sorting and insertion to hash index both consistent with
> each other.
>
> B. In tuple sort we can use hash function bucket = hash_key %
> num_buckets instead of existing one which does bitwise "and" to
> determine the bucket of hash key. This way we will not wrongly assign
> buckets beyond max_buckets and sorted hash keys will be in sync with
> actual insertion order of _hash_doinsert.
>
>
> I am adding both the patches Patch_A and Patch_B. My preference is
> Patch_A and I am open for suggestion.
>

I think both way it can work.  I feel there is no hard pressed need
that we should make the computation in sorting same as what you do
_hash_doinsert. In patch_B,  I don't think new naming for variables is
good.

  Assert(!a->isnull1);
- hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
+ bucket1 = DatumGetUInt32(a->datum1) % state->num_buckets;

Can we use hash_mod instead of num_buckets and retain hash1 in above
code and similar other places?

>>+#define SPLITPOINT_PHASES_PER_GRP 4
>>+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
>>+#define Buckets_First_Split_Group 4
> Fixed.
>
>>In the above computation +2 and -2 still bothers me.  I think you need
>>to do this because you have defined split group zero to have four
>>buckets, how about if you don't force that and rather define to have
>>split point phases only from split point which has four or more
>>buckets?
>
> Okay as suggested instead of group zero having 4 phases of 1 bucket
> each, I have recalculated the spare mapping as below.
>

Few comments:
1.
+#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
+ ((sp_g < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
+ (1 << (sp_g - 1)) : \
+ ((nphase) * ((1 << (sp_g - 1)) >> 2)))

This will go wrong for split point group zero.  In general, I feel if
you handle computation for split groups lesser than
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
macros will become much simpler and less error prone.

2.
+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) (_hash_log2(num_bucket))

What is the use of such a define, can't we directly use _hash_log2 in
the caller?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallelize queries containing initplans

2017-03-27 Thread Amit Kapila
On Thu, Mar 16, 2017 at 2:34 AM, Kuntal Ghosh
<kuntalghosh.2...@gmail.com> wrote:
> On Tue, Mar 14, 2017 at 3:20 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Based on that idea, I have modified the patch such that it will
>> compute the set of initplans Params that are required below gather
>> node and store them as bitmap of initplan params at gather node.
>> During set_plan_references, we can find the intersection of external
>> parameters that are required at Gather or nodes below it with the
>> initplans that are passed from same or above query level. Once the set
>> of initplan params are established, we evaluate those (if they are not
>> already evaluated) before execution of gather node and then pass the
>> computed value to each of the workers.   To identify whether a
>> particular param is parallel safe or not, we check if the paramid of
>> the param exists in initplans at same or above query level.  We don't
>> allow to generate gather path if there are initplans at some query
>> level below the current query level as those plans could be
>> parallel-unsafe or undirect correlated plans.
>
> I would like to mention different test scenarios with InitPlans that
> we've considered while developing and testing of the patch.
>

Thanks a lot Kuntal for sharing different test scenarios.
Unfortunately, this patch doesn't received any review till now, so
there is no chance of making it in to PostgreSQL-10.  I have moved
this to next CF.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-27 Thread Amit Kapila
On Sat, Mar 25, 2017 at 1:24 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Mar 24, 2017 at 11:49 PM, Pavan Deolasee
> <pavan.deola...@gmail.com> wrote:
>>
>> On Fri, Mar 24, 2017 at 6:46 PM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>>>
>
>> While looking at this problem, it occurred to me that the assumptions made
>> for hash indexes are also wrong :-( Hash index has the same problem as
>> expression indexes have. A change in heap value may not necessarily cause a
>> change in the hash key. If we don't detect that, we will end up having two
>> hash identical hash keys with the same TID pointer. This will cause the
>> duplicate key scans problem since hashrecheck will return true for both the
>> hash entries.

Isn't it possible to detect duplicate keys in hashrecheck if we
compare both hashkey and tid stored in index tuple with the
corresponding values from heap tuple?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-27 Thread Amit Kapila
On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Sun, Mar 26, 2017 at 11:26 AM, Mithun Cy <mithun...@enterprisedb.com> 
> wrote:
>> Thanks, Amit for the review.
>> On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>>
>>> I think one-dimensional patch has fewer places to touch, so that looks
>>> better to me.  However, I think there is still hard coding and
>>> assumptions in code which we should try to improve.
>>
>> Great!, I will continue with spares 1-dimensional improvement.
>>
>
> @@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double
> num_tuples, RegProcedure procid,\
> {
> ..
>   else
> - num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
> + num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
> ..
> ..
> - metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
> - metap->hashm_highmask = (num_buckets << 1) - 1;
> + metap->hashm_maxbucket = num_buckets - 1;
> +
> + /* set hishmask, which should be sufficient to cover num_buckets. */
> + metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
> + metap->hashm_lowmask = (metap->hashm_highmask >> 1);
> }
>
> I think we can't change the number of buckets to be created or lowmask
> and highmask calculation here without modifying _h_spoolinit() because
> it sorts the data to be inserted based on hashkey which in turn
> depends on the number of buckets that we are going to create during
> create index operation.  We either need to allow create index
> operation to still always create buckets in power-of-two fashion or we
> need to update _h_spoolinit according to new computation.  One minor
> drawback of using power-of-two scheme for creation of buckets during
> create index is that it can lead to wastage of space and will be
> inconsistent with what the patch does during split operation.
>

Few more comments:

1.
@@ -149,6 +149,84 @@ _hash_log2(uint32 num)
  return i;
 }

+#define SPLITPOINT_PHASES_PER_GRP 4
+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
+#define Buckets_First_Split_Group 4

The defines should be at the beginning of file.

2.
+/*
+ * At splitpoint group 0 we have 2 ^ (0 + 2) = 4 buckets, then at splitpoint
+ * group 1 we have 2 ^ (1 + 2) = 8 total buckets. As the doubling continues at
+ * splitpoint group "x" we will have 2 ^ (x + 2) total buckets. Total buckets
+ * before x splitpoint group will be 2 ^ (x + 1). At each phase of allocation
+ * within splitpoint group we add 2 ^ (x - 1) buckets, as we have to divide the
+ * task of allocation of 2 ^ (x + 1) buckets among 4 phases.
+ *
+ * Also, splitpoint group of a given bucket can be taken as
+ * (_hash_log2(bucket) - 2). Subtracted by 2 because each group have
+ * 2 ^ (x + 2) buckets.
+ */
..

+#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) \
+ ((num_bucket <= Buckets_First_Split_Group) ? 0 : \
+ (_hash_log2(num_bucket) - 2))

In the above computation +2 and -2 still bothers me.  I think you need
to do this because you have defined split group zero to have four
buckets, how about if you don't force that and rather define to have
split point phases only from split point which has four or more
buckets?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-26 Thread Amit Kapila
On Sun, Mar 26, 2017 at 11:26 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> Thanks, Amit for the review.
> On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>
>> I think one-dimensional patch has fewer places to touch, so that looks
>> better to me.  However, I think there is still hard coding and
>> assumptions in code which we should try to improve.
>
> Great!, I will continue with spares 1-dimensional improvement.
>

@@ -563,18 +563,20 @@ _hash_init_metabuffer(Buffer buf, double
num_tuples, RegProcedure procid,\
{
..
  else
- num_buckets = ((uint32) 1) << _hash_log2((uint32) dnumbuckets);
+ num_buckets = _hash_get_totalbuckets(_hash_spareindex(dnumbuckets));
..
..
- metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
- metap->hashm_highmask = (num_buckets << 1) - 1;
+ metap->hashm_maxbucket = num_buckets - 1;
+
+ /* set hishmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;
+ metap->hashm_lowmask = (metap->hashm_highmask >> 1);
}

I think we can't change the number of buckets to be created or lowmask
and highmask calculation here without modifying _h_spoolinit() because
it sorts the data to be inserted based on hashkey which in turn
depends on the number of buckets that we are going to create during
create index operation.  We either need to allow create index
operation to still always create buckets in power-of-two fashion or we
need to update _h_spoolinit according to new computation.  One minor
drawback of using power-of-two scheme for creation of buckets during
create index is that it can lead to wastage of space and will be
inconsistent with what the patch does during split operation.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-25 Thread Amit Kapila
On Sat, Mar 25, 2017 at 11:24 PM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
> On Sat, 25 Mar 2017 at 11:03 PM, Peter Geoghegan <p...@bowt.ie> wrote:
>>
>> On Sat, Mar 25, 2017 at 12:54 AM, Amit Kapila <amit.kapil...@gmail.com>
>> wrote:
>> > I am not sure how do you want to binary compare two datums, if you are
>> > thinking datumIsEqual(), that won't work.  I think you need to use
>> > datatype specific compare function something like what we do in
>> > _bt_compare().
>>
>> How will that interact with types like numeric, that have display
>> scale or similar?
>>
>  I wonder why Amit thinks that datumIsEqual won't work once we convert the
> heap values to index tuple and then fetch using index_get_attr. After all
> that's how the current index tuple was constructed when it was inserted.

I think for toasted values you need to detoast before comparison and
it seems datamIsEqual won't do that job.  Am I missing something which
makes you think that datumIsEqual will work in this context.

> In
> fact, we must not rely on _bt_compare because that might return "false
> positive" even for two different heap binary values  (I think).

I am not telling to rely on _bt_compare, what I was trying to hint at
it was that I think we might need to use some column type specific
information for comparison.  I am not sure at this stage what is the
best way to deal with this problem without incurring non trivial cost.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-25 Thread Amit Kapila
On Sat, Mar 25, 2017 at 10:13 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Sure, I was telling you based on that.  If you are implicitly treating
>> it as 2-dimensional array, it might be easier to compute the array
>>offsets.
>
> I think calculating spares offset will not become anyway much simpler
> we still need to calculate split group and split phase separately. I
> have added a patch to show how a 2-dimensional spares code looks like
> and where all we need changes.
>

I think one-dimensional patch has fewer places to touch, so that looks
better to me.  However, I think there is still hard coding and
assumptions in code which we should try to improve.

1.
+ /*
+ * The first 4 bucket belongs to first splitpoint group 0. And since group
+ * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets
+ * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3
+ * buckets to double the total number of buckets, which will become 2^4. I
+ * think by this time we can see a pattern which say if num_bucket > 4
+ * splitpoint group = log2(num_bucket) - 2
+ */

+ if (num_bucket <= 4)
+ splitpoint_group = 0; /* converted to base 0. */
+ else
+ splitpoint_group = _hash_log2(num_bucket) - 2;

This patch defines split point group zero has four buckets and based
on that above calculation is done.  I feel you can define it like
#define Buckets_First_Split_Group 4 and then use it in above code.
Also, in else part number 2 looks awkward, can we define it as
log2_buckets_first_group = _hash_log2(Buckets_First_Split_Group) or
some thing like that.  I think that way code will look neat.  I don't
like the way above comment is worded even though it is helpful to
understand the calculation.  If you want, then you can add such a
comment in file header, here it looks out of place.

2.
+power-of-2 groups, called "split points" in the code.  That means on every new
+split points we double the existing number of buckets.

split points/split point

3.
+ * which phase of allocation the bucket_num belogs to with in the group.

/belogs/belongs


I have still not completely reviewed the patch as I have ran out of
time for today.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] crashes due to setting max_parallel_workers=0

2017-03-25 Thread Amit Kapila
On Sat, Mar 25, 2017 at 6:31 PM, David Rowley
<david.row...@2ndquadrant.com> wrote:
> On 25 March 2017 at 23:09, Rushabh Lathia <rushabh.lat...@gmail.com> wrote:
>> Also another point which I think we should fix is, when someone set
>> max_parallel_workers = 0, we should also set the
>> max_parallel_workers_per_gather
>> to zero. So that way it we can avoid generating the gather path with
>> max_parallel_worker = 0.
>
> I see that it was actually quite useful that it works the way it does.
> If it had worked the same as max_parallel_workers_per_gather, then
> likely Tomas would never have found this bug.
>
> I wondered if there's anything we can do here to better test cases
> when no workers are able to try to ensure the parallel nodes work
> correctly, but the more I think about it, the more I don't see wrong
> with just using SET max_parallel_workers = 0;
>
> My vote would be to leave the GUC behaviour as is and add some tests
> to select_parallel.sql which exploit setting max_parallel_workers to 0
> for running some tests.
>

I think force_parallel_mode=regress should test the same thing.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.

2017-03-25 Thread Amit Kapila
On Sat, Mar 25, 2017 at 12:33 PM, Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
> On Sat, Mar 25, 2017 at 11:02 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Thu, Mar 23, 2017 at 11:24 PM, Ashutosh Sharma <ashu.coe...@gmail.com> 
>> wrote:
>>> Hi,
>>>
>>> On Tue, Feb 7, 2017 at 9:23 AM, Robert Haas <robertmh...@gmail.com> wrote:
>>>> On Mon, Feb 6, 2017 at 10:40 PM, Amit Kapila <amit.kapil...@gmail.com> 
>>>> wrote:
>>>>>> Maybe we should call them "unused pages".
>>>>>
>>>>> +1.  If we consider some more names for that column then probably one
>>>>> alternative could be "empty pages".
>>>>
>>>> Yeah, but I think "unused" might be better.  Because a page could be
>>>> in use (as an overflow page or primary bucket page) and still be
>>>> empty.
>>>>
>>>
>>> Based on the earlier discussions, I have prepared a patch that would
>>> allow pgstathashindex() to show the number of unused pages in hash
>>> index. Please find the attached patch for the same. Thanks.
>>>
>>
>>   else if (opaque->hasho_flag & LH_BITMAP_PAGE)
>>   stats.bitmap_pages++;
>> + else if (PageIsEmpty(page))
>> + stats.unused_pages++;
>>
>> I think having this check after PageIsNew() makes more sense then
>> having at the place where you currently have,
>
> I don't think having it just below PageIsNew() check would be good.
> The reason being, there can be a bucket page which is in use but still
> be empty. So, if we add a check just below PageIsNew check then even
> though a page is marked as bucket page and is empty it will be shown
> as unused page which i feel is not correct. Here is one simple example
> that illustrates this point,
>

oh, is it a page where all the items have been deleted and no new
items have been inserted?  The reason why I have told that place is
not appropriate is because all the other checks in near by code are on
the page type and this check looks odd at that place, but we might
need to do this way if there is no other clean solution.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    1   2   3   4   5   6   7   8   9   10   >