Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-03-12 Thread Kouhei Kaigai
> Hello,
> 
> On Wed, March 1, 2017 7:21 pm, Kouhei Kaigai wrote:
> >> I've looked at the patch, and as I'm not that familiar with the
> >> pg-sourcecode, customs and so on, this isn't a review, more like food
> >> for thought and all should be taken with a grain of salt. :)
> >>
> >> So here are a few questions and remarks:
> >>
> >> >+ double  limit_tuples = -1.0;
> >>
> >> Surely the limit cannot be fractional, and must be an integer. So
> >> wouldn't it be better the same type as say:
> >>
> >> >+ if (root->limit_tuples >= 0.0 &&
> >>
> >> Than you could also compare with ">= 0", not ">= 0.0".
> >>
> > The above variable represents the "estimated" number of rows at the
> > planning stage, not execution time.
> > You may be able to see Path structure has "rows" field declared as
> > double type. It makes sense to consider stupid paths during planning,
> > even if it is eventually rejected. For example, if a cross join with
> > two large tables appear during planning, 64bit integer will make
> > overflow easily.
> 
> Hm, ok. Not related to your patch, just curious: Is there a mechanism in
> place that automatically rejects plans where the limit would overflow the
> double to uint64 conversation? Or is this more of a "there would be hopefully
> a plan with a better limit so we do not use the bad one"?
> 
> Would it possible to force a plan where such overflow would occur?
>
We have no such mechanism, and less necessity.
Estimated number of rows in plan time is stored in the plan_rows field of
the Plan structure, as FP64. Once plan-tree gets constructed, estimated
number of rows shall not affect to the execution. (Some plan might use it
for estimation of resource consumption on execution time.)
On the other hands, the actual number of rows that was processed is saved
on the instrument field of the PlanState structure. It is counted up from
the zero by one. So, people wants to replace the hardware prior to uint64
overflow. If 1B rows are processed per sec, uint64 overflow happen at 26th
century(!).

> >> And this comment might be better "were we already called?"
> >>
> >> >+ boolrs_started; /* are we already
> >> called? */
> >>
> > Other variables in ResultState uses present form, like:
> >
> > +   boolrs_started; /* are we already called? */
> > boolrs_done;/* are we done? */
> > boolrs_checkqual;   /* do we need to check the qual? */
> >  } ResultState;
> 
> Yes, I noted that, but still "are" and "called" and "already" don't read
> well together for me:
> 
>   are - present form
>   called - past form like "were we called?", or "are we called bob?" an
> ongoing process
>   already - it has started
> 
> So "are we already called" reads like someone is waiting for being called.
> 
> Maybe to mirror the comment on "rs_done":
> 
>   /* have we started yet? */
> 
> Also, maybe it's easier for the comment to describe what is happening in
> the code because of the flag, not just to the flag itself:
> 
>   /* To do things once when we are called */
> 
> Anyway, it is a minor point and don't let me distract you from your work,
> I do like the feature and the patch :)
>
Fixed to "have we started yet?"


PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kai...@ak.jp.nec.com>



passdown-limit-fdw.v6.patch
Description: passdown-limit-fdw.v6.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-03-01 Thread Kouhei Kaigai
> Hello all,
> 
> as this is my first mail to pgsql-hackers, please be gentle :)
>
Welcome to pgsql-hackers,

> I've looked at the patch, and as I'm not that familiar with the pg-sourcecode,
> customs and so on, this isn't a review, more like food for thought and all
> should be taken with a grain of salt. :)
> 
> So here are a few questions and remarks:
> 
> >+double  limit_tuples = -1.0;
> 
> Surely the limit cannot be fractional, and must be an integer. So wouldn't
> it be better the same type as say:
> 
> >+if (root->limit_tuples >= 0.0 &&
> 
> Than you could also compare with ">= 0", not ">= 0.0".
>
The above variable represents the "estimated" number of rows at the
planning stage, not execution time.
You may be able to see Path structure has "rows" field declared as
double type. It makes sense to consider stupid paths during planning,
even if it is eventually rejected. For example, if a cross join with
two large tables appear during planning, 64bit integer will make overflow
easily.

> node->ss.ps.ps_numTuples is f.i. an uint64.
> 
> Or is there a specific reason the limit must be a double?
>
The above variable represents "actual" number of rows at the execution
stage. Likely, hardware replacement cycle will come before int64 overflow.

> And finally:
> 
> >+if (node->ss.ps.ps_numTuples > 0)
> 
> >+appendStringInfo(, " LIMIT %ld",
> node->ss.ps.ps_numTuples);
> 
> vs.
> 
> >+appendStringInfo(, "%s LIMIT %lu",
> >+ sql,
> node->ss.ps.ps_numTuples);
> 
> It seems odd to have two different format strings here for the same variable.
>
Ah, yes, %lu is right because ps_numTuples is uint64.

> A few comments miss "." at the end, like these:
> 
> >+ * Also, pass down the required number of tuples
> 
> >+ * Pass down the number of required tuples by the upper node
> 
OK,

> And this comment might be better "were we already called?"
> 
> >+boolrs_started; /* are we already
> called? */
> 
Other variables in ResultState uses present form, like:

+   boolrs_started; /* are we already called? */
boolrs_done;/* are we done? */
boolrs_checkqual;   /* do we need to check the qual? */
 } ResultState;

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


passdown-limit-fdw.v5.patch
Description: passdown-limit-fdw.v5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-02-28 Thread Kouhei Kaigai
The attached patch is rebased version of pass-down LIMIT clause patch,
which was forgotten to register on the last CF.

It allows to inform required number of rows to the sub-plans not only
ones we have individually handled at pass_down_bound().
Its primary target is control of number of remote tuple transfer over
the network connection by postgres_fdw.

According to the past discussion, we add a new field @ps_numTuples on
the PlanState to represent the required number of tuples.
Limit node assign a particular number on the field of sub-plan, then
this sub-plan can know its upper node does not require entire tuples,
and adjust its execution storategy.
Like MergeAppend, the sub-plan can also pass down the bounds to its
sub-plans again, if it makes sense and works correctly.

This feature is potentially a basis of GPU-based sorting on top of
CustomScan, because it has advantage for a workload to pick up the
top-N tuples if its data-size is enough small to load onto GPU-RAM.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kai...@ak.jp.nec.com>


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai
> Sent: Tuesday, January 03, 2017 12:07 PM
> To: Kaigai Kouhei(海外 浩平) <kai...@ak.jp.nec.com>
> Cc: Jeevan Chalke <jeevan.cha...@enterprisedb.com>; Robert Haas
> <robertmh...@gmail.com>; pgsql-hackers@postgresql.org; Etsuro Fujita
> <fujita.ets...@lab.ntt.co.jp>; Andres Freund <and...@anarazel.de>
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> [take-2]
> 
> Oops, I oversight this patch was marked as "returned with feedback", not
> "moved to the next CF".
> 
> Its status has not been changed since the last update. (Code was revised
> according to the last comment by Jeevan, but CF-Nov was time up at that
> time.)
> 
> How do I handle the patch?
> 
> 2016-12-05 16:49 GMT+09:00 Kouhei Kaigai <kai...@ak.jp.nec.com>:
> > Hello,
> >
> > Sorry for my late response.
> > The attached patch reflects your comments.
> >
> >> Here are few comments on latest patch:
> >>
> >>
> >> 1.
> >> make/make check is fine, however I am getting regression failure in
> >> postgres_fdw contrib module (attached regression.diff).
> >> Please investigate and fix.
> >>
> > It was an incorrect interaction when postgres_fdw tries to push down
> > sorting to the remote side. We cannot attach LIMIT clause on the plain
> > scan path across SORT, however, the previous version estimated the
> > cost for the plain scan with LIMIT clause even if local sorting is needed.
> > If remote scan may return just 10 rows, estimated cost of the local
> > sort is very lightweight, thus, this unreasonable path was chosen.
> > (On the other hands, its query execution results were correct because
> > ps_numTuples is not delivered across Sort node, so ForeignScan
> > eventually scanned all the remote tuples. It made correct results but
> > not optimal from the viewpoint of performance.)
> >
> > The v3 patch estimates the cost with remote LIMIT clause only if
> > supplied pathkey strictly matches with the final output order of the
> > query, thus, no local sorting is expected.
> >
> > Some of the regression test cases still have different plans but due
> > to the new optimization by remote LIMIT clause.
> > Without remote LIMIT clause, some of regression test cases preferred
> > remote-JOIN + local-SORT then local-LIMIT.
> > Once we have remote-LIMIT option, it allows to discount the cost for
> > remote-SORT by choice of top-k heap sorting.
> > It changed the optimizer's decision on some test cases.
> >
> > Potential one big change is the test case below.
> >
> >  -- CROSS JOIN, not pushed down
> >  EXPLAIN (VERBOSE, COSTS OFF)
> >  SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1,
> > t2.c1 OFFSET 100 LIMIT 10;
> >
> > It assumed CROSS JOIN was not pushed down due to the cost for network
> > traffic, however, remote LIMIT reduced the estimated number of tuples
> > to be moved. So, all of the CROSS JOIN + ORDER BY + LIMIT became to
> > run on the remote side.
> >
> >> 2.
> >> + *
> >> + * MEMO: root->limit_tuples is not attached when query
> >> contains
> >> + * grouping-clause or aggregate functions. So, we don's
> adjust
> >> + * rows even if LIMIT  is supplied.
> >>
> >> Can you please explain why you are not doing this for grouping-clause
> >> 

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-19 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Monday, February 20, 2017 2:20 AM
> To: Kaigai Kouhei(海外 浩平) <kai...@ak.jp.nec.com>
> Cc: Claudio Freire <klaussfre...@gmail.com>; Amit Kapila
> <amit.kapil...@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Fri, Feb 17, 2017 at 12:46 PM, Kouhei Kaigai <kai...@ak.jp.nec.com>
> wrote:
> > The attached patch is revised one.
> >
> > Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
> > ExecParallelRetrieveInstrumentation() not to walk on the plan- state
> > tree twice.
> > One (hypothetical) downside is, FDW/CSP can retrieve its own run-time
> > statistics only when query is executed under EXPLAIN ANALYZE.
> >
> > This enhancement allows FDW/CSP to collect its specific run- time
> > statistics more than Instrumentation, then show them as output of
> > EXPLAIN. My expected examples are GPU's kernel execution time, DMA
> > transfer ratio and so on. These statistics will never appear in the
> > Instrumentation structure, however, can be a hot- point of performance
> > bottleneck if CustomScan works on background workers.
> 
> Would gather_shutdown_children_first.patch from
> https://www.postgresql.org/message-id/CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbn
> b8eb10c-6aywjdxb...@mail.gmail.com
> help with this problem also?  Suppose we did that, and then also added an
> ExecShutdownCustom method.  Then you'd definitely be able to get control
> before the DSM went away, either from ExecEndNode() or ExecShutdownNode().
> 
Ah, yes, I couldn't find any problem around the above approach.
ExecShutdownGather() can be called by either ExecShutdownNode() or
ExecEndGather(). This patch allows to have an entrypoint for CSP/FDW
prior to release of the DSM.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-16 Thread Kouhei Kaigai
Hello,

The attached patch is revised one.

Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
ExecParallelRetrieveInstrumentation() not to walk on the plan-
state tree twice.
One (hypothetical) downside is, FDW/CSP can retrieve its own
run-time statistics only when query is executed under EXPLAIN
ANALYZE.

This enhancement allows FDW/CSP to collect its specific run-
time statistics more than Instrumentation, then show them as
output of EXPLAIN. My expected examples are GPU's kernel execution
time, DMA transfer ratio and so on. These statistics will never
appear in the Instrumentation structure, however, can be a hot-
point of performance bottleneck if CustomScan works on background
workers.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kai...@ak.jp.nec.com>


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Monday, February 06, 2017 3:37 PM
> To: Kaigai Kouhei(海外 浩平) <kai...@ak.jp.nec.com>
> Cc: Amit Kapila <amit.kapil...@gmail.com>; Robert Haas
> <robertmh...@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Feb 6, 2017 at 1:42 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > I also had thought an idea to have extra space to Instrumentation
> > structure, however, it needs to make Instrumentation flexible-length
> > structure according to the custom format by CSP/FDW. Likely, it is not
> a good design.
> > As long as extension can retrieve its custom statistics on DSM area
> > required by ExecParallelEstimate(), I have no preference on the hook
> location.
> 
> That's what I had in mind: the hook happens there, but the extension
> retrieves the information from some extension-specific DSM area, just as
> it would on the ParallelFinish hook.
> 
> > One thing we may pay attention is, some extension (not mine) may want
> > to collect worker's statistics regardless of Instrumentation (in other
> > words, even if plan is not under EXPLAIN ANALYZE).
> > It is the reason why I didn't put a hook under the
> ExecParallelRetrieveInstrumentation().
> 
> I don't think you should worry about that as long as it's a hypothetical
> case.
> 
> If/when some extension actually needs to do that, the design can be discussed
> with a real use case at hand, and not a hypothetical one.
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


parallel-finish-fdw_csp.v2.patch
Description: parallel-finish-fdw_csp.v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Monday, February 06, 2017 1:05 PM
> To: Kaigai Kouhei(海外 浩平) <kai...@ak.jp.nec.com>
> Cc: Amit Kapila <amit.kapil...@gmail.com>; Robert Haas
> <robertmh...@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Feb 6, 2017 at 1:00 AM, Claudio Freire <klaussfre...@gmail.com>
> wrote:
> > On Sun, Feb 5, 2017 at 9:19 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> >>> If the use case for this is to gather instrumentation, I'd suggest
> >>> calling the hook in RetrieveInstrumentation, and calling it
> >>> appropriately. It would make the intended use far clearer than it is
> now.
> >>>
> >>> And if it saves some work, all the better.
> >>>
> >>> Until there's a use case for a non-instrumentation hook in that
> >>> place, I wouldn't add it. This level of generality sounds like a
> >>> solution waiting for a problem to solve.
> >>>
> >> The use cases I'd like to add are extension specific but significant
> >> for performance analytics. These statistics are not included in
> Instrumentation.
> >> For example, my problems are GPU execution time, data transfer ratio
> >> over DMA, synchronization time for GPU task completion, and so on.
> >> Only extension can know which attributes are collected during the
> >> execution, and its data format. I don't think Instrumentation fits these
> requirements.
> >> This is a problem I faced on the v9.6 based interface design, so I
> >> could notice it.
> >
> >
> > What RetrieveInstrumentation currently does may not cover the
> > extension's specific instrumentation, but what I'm suggesting is
> > adding a hook, like the one you're proposing for ParallelFinish, that
> > would collect extension-specific instrumentation. Couple that with
> > extension-specific explain output and you'll get your use cases
> > covered, I think.
> 
> 
> In essence, you added a ParallelFinish that is called after
> RetrieveInstrumentation. That's overreaching, for your use case.
> 
> If instrumentation is what you're collecting, it's far more correct, and
> more readable, to have a hook called from inside RetrieveInstrumentation
> that will retrieve extension-specific instrumentation.
> 
> RetrieveInstrumentation itself doesn't need to parse that information, that
> will be loaded into the custom scan nodes, and those are the ones that will
> parse it when generating explain output.
> 
> So, in essence it's almost what you did with ParallelFinish, more clearly
> aimed at collecting runtime statistics.
> 
I also had thought an idea to have extra space to Instrumentation structure,
however, it needs to make Instrumentation flexible-length structure according
to the custom format by CSP/FDW. Likely, it is not a good design.
As long as extension can retrieve its custom statistics on DSM area required
by ExecParallelEstimate(), I have no preference on the hook location.

One thing we may pay attention is, some extension (not mine) may want to
collect worker's statistics regardless of Instrumentation (in other words,
even if plan is not under EXPLAIN ANALYZE).
It is the reason why I didn't put a hook under the 
ExecParallelRetrieveInstrumentation().

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-05 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Saturday, February 04, 2017 8:47 AM
> To: Kaigai Kouhei(海外 浩平) <kai...@ak.jp.nec.com>
> Cc: Amit Kapila <amit.kapil...@gmail.com>; Robert Haas
> <robertmh...@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Oct 31, 2016 at 11:33 AM, Kouhei Kaigai <kai...@ak.jp.nec.com>
> wrote:
> > Hello,
> >
> > The attached patch implements the suggestion by Amit before.
> >
> > What I'm motivated is to collect extra run-time statistics specific to
> > a particular ForeignScan/CustomScan, not only the standard
> > Instrumentation; like DMA transfer rate or execution time of GPU
> > kernels in my case.
> >
> > Per-node DSM toc is one of the best way to return run-time statistics
> > to the master backend, because FDW/CSP can assign arbitrary length of
> > the region according to its needs. It is quite easy to require.
> > However, one problem is, the per-node DSM toc is already released when
> > ExecEndNode() is called on the child node of Gather.
> >
> > This patch allows extensions to get control on the master backend's
> > context when all the worker node gets finished but prior to release of
> > the DSM segment. If FDW/CSP has its special statistics on the segment,
> > it can move to the private memory area for EXPLAIN output or something
> > other purpose.
> >
> > One design consideration is whether the hook shall be called from
> > ExecParallelRetrieveInstrumentation() or ExecParallelFinish().
> > The former is a function to retrieve the standard Instrumentation
> > information, thus, it is valid only if EXPLAIN ANALYZE.
> > On the other hands, if we put entrypoint at ExecParallelFinish(),
> > extension can get control regardless of EXPLAIN ANALYZE, however, it
> > also needs an extra planstate_tree_walker().
> 
> If the use case for this is to gather instrumentation, I'd suggest calling
> the hook in RetrieveInstrumentation, and calling it appropriately. It would
> make the intended use far clearer than it is now.
> 
> And if it saves some work, all the better.
> 
> Until there's a use case for a non-instrumentation hook in that place, I
> wouldn't add it. This level of generality sounds like a solution waiting
> for a problem to solve.
>
The use cases I'd like to add are extension specific but significant for
performance analytics. These statistics are not included in Instrumentation.
For example, my problems are GPU execution time, data transfer ratio over
DMA, synchronization time for GPU task completion, and so on.
Only extension can know which attributes are collected during the execution,
and its data format. I don't think Instrumentation fits these requirements.
This is a problem I faced on the v9.6 based interface design, so I could
notice it.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] T_Float morph to T_Integer after nodeRead

2017-01-05 Thread Kouhei Kaigai
> Kouhei Kaigai <kai...@ak.jp.nec.com> writes:
> > Simplified description of what I did is:
> >   fval = makeFloat(psprintf("%.0f", plan_nrows));
> >   custom_scan->custom_private = list_make1(fval);
> 
> So don't do that.  The lexer would never produce T_Float for an
> integer-looking string, so I think it's out of scope for nodeRead() to be
> able to reconstitute such a thing.  You could use %.1f, perhaps.
> But actually, if you're worried about reconstituting exactly what you had,
> aren't you risking precision loss anyway?  Maybe something like
> psprintf("%.*e", DBL_DIG+3, plan_nrows) would be safer.
>
Ah, indeed, it is a good idea to avoid the problem.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] T_Float morph to T_Integer after nodeRead

2017-01-05 Thread Kouhei Kaigai
I noticed a strange behavior when T_Float value is serialized, then deserialized
on the worker process for cpu parallel execution.

Simplified description of what I did is:
  fval = makeFloat(psprintf("%.0f", plan_nrows));
  custom_scan->custom_private = list_make1(fval);

This string expression contains no dot, then Float value was written out
as if it is an integer value, like "654321".

nodeRead() calls nodeTokenType() to determine the token type.
It determines a numeric token with no dot an Integer value, even if it is
generated by makeFloat(). Then, the worker process reference this value
using floatVal() and gets SEGV.

A workaround is that we never use "%.0f" format for makeFloat().
It may be sufficient because we have small number of makeFloat() call all
around the source tree. Or, do we have any other systematic solution to
prevent numeric cstring without dot?

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-01-04 Thread Kouhei Kaigai



> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, January 05, 2017 5:29 AM
> To: Kohei KaiGai 
> Cc: Kaigai Kouhei(海外 浩平) ; Jeevan Chalke
> ; pgsql-hackers@postgresql.org; Etsuro
> Fujita ; Andres Freund 
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> [take-2]
> 
> On Mon, Jan 2, 2017 at 10:07 PM, Kohei KaiGai  wrote:
> > Oops, I oversight this patch was marked as "returned with feedback",
> > not "moved to the next CF".
> >
> > Its status has not been changed since the last update. (Code was
> > revised according to the last comment by Jeevan, but CF-Nov was time
> > up at that time.)
> >
> > How do I handle the patch?
> 
> Can you just change the status to "Moved to Next CF" and then make it "Needs
> Review"?
> 
Unfortunately, it was not possible. Probably, administrator privilege will be
needed for this operation.
May I add it to the CF:2017-03?

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-12-04 Thread Kouhei Kaigai
Hello,

Sorry for my late response.
The attached patch reflects your comments.

> Here are few comments on latest patch:
> 
> 
> 1.
> make/make check is fine, however I am getting regression failure in
> postgres_fdw contrib module (attached regression.diff).
> Please investigate and fix.
>
It was an incorrect interaction when postgres_fdw tries to push down
sorting to the remote side. We cannot attach LIMIT clause on the plain
scan path across SORT, however, the previous version estimated the cost
for the plain scan with LIMIT clause even if local sorting is needed.
If remote scan may return just 10 rows, estimated cost of the local sort
is very lightweight, thus, this unreasonable path was chosen.
(On the other hands, its query execution results were correct because
ps_numTuples is not delivered across Sort node, so ForeignScan eventually
scanned all the remote tuples. It made correct results but not optimal
from the viewpoint of performance.)

The v3 patch estimates the cost with remote LIMIT clause only if supplied
pathkey strictly matches with the final output order of the query, thus,
no local sorting is expected.

Some of the regression test cases still have different plans but due to
the new optimization by remote LIMIT clause.
Without remote LIMIT clause, some of regression test cases preferred
remote-JOIN + local-SORT then local-LIMIT.
Once we have remote-LIMIT option, it allows to discount the cost for
remote-SORT by choice of top-k heap sorting.
It changed the optimizer's decision on some test cases.

Potential one big change is the test case below.

 -- CROSS JOIN, not pushed down
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1, t2.c1 OFFSET 
100 LIMIT 10;

It assumed CROSS JOIN was not pushed down due to the cost for network
traffic, however, remote LIMIT reduced the estimated number of tuples
to be moved. So, all of the CROSS JOIN + ORDER BY + LIMIT became to run
on the remote side.

> 2.
> + *
> + * MEMO: root->limit_tuples is not attached when query
> contains
> + * grouping-clause or aggregate functions. So, we don's adjust
> + * rows even if LIMIT  is supplied.
> 
> Can you please explain why you are not doing this for grouping-clause or
> aggregate functions.
>
See grouping_planner() at optimizer/plan/planner.c
It puts an invalid value on the root->limit_tuples if query has GROUP BY clause,
so we cannot know number of tuples to be returned even if we have upper limit
actually.

/*
 * Figure out whether there's a hard limit on the number of rows that
 * query_planner's result subplan needs to return.  Even if we know a
 * hard limit overall, it doesn't apply if the query has any
 * grouping/aggregation operations, or SRFs in the tlist.
 */
if (parse->groupClause ||
parse->groupingSets ||
parse->distinctClause ||
parse->hasAggs ||
parse->hasWindowFuncs ||
parse->hasTargetSRFs ||
root->hasHavingQual)
root->limit_tuples = -1.0;
else
root->limit_tuples = limit_tuples;

> 3.
> Typo:
> 
> don's  => don't
>
Fixed,

best regards,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 



passdown-limit-fdw.v3.patch
Description: passdown-limit-fdw.v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-21 Thread Kouhei Kaigai
Hello,

The attached patch is a revised version of pass-down LIMIT to FDW/CSP.

Below is the updates from the last version.

'ps_numTuples' of PlanState was declared as uint64, instead of long
to avoid problems on 32bits machine when a large LIMIT clause is
supplied.

'ps_numTuples' is re-interpreted; 0 means that its upper node wants
to fetch all the tuples. It allows to eliminate a boring initialization
on ExecInit handler for each executor node.

Even though it was not suggested, estimate_path_cost_size() of postgres_fdw
adjusts number of rows if foreign-path is located on top-level of
the base-relations and LIMIT clause takes a constant value.
It will make more adequate plan as follows:

* WITHOUT this patch

postgres=# explain verbose select * from t_a, t_b where t_a.id = t_b.id and 
t_a.x < t_b.x LIMIT 100;
   QUERY PLAN

 Limit  (cost=261.17..274.43 rows=100 width=88)
   Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
   ->  Hash Join  (cost=261.17..581.50 rows=2416 width=88)
 Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
 Hash Cond: (t_a.id = t_b.id)
 Join Filter: (t_a.x < t_b.x)
 ->  Foreign Scan on public.t_a  (cost=100.00..146.12 rows=1204 
width=44)
   Output: t_a.id, t_a.x, t_a.y
   Remote SQL: SELECT id, x, y FROM public.t
 ->  Hash  (cost=146.12..146.12 rows=1204 width=44)
   Output: t_b.id, t_b.x, t_b.y
   ->  Foreign Scan on public.t_b  (cost=100.00..146.12 rows=1204 
width=44)
 Output: t_b.id, t_b.x, t_b.y
 Remote SQL: SELECT id, x, y FROM public.t
(14 rows)

* WITH this patch
-
postgres=# explain verbose select * from t_a, t_b where t_a.id = t_b.id and 
t_a.x < t_b.x LIMIT 100;
  QUERY PLAN
--
 Limit  (cost=100.00..146.58 rows=100 width=88)
   Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
   ->  Foreign Scan  (cost=100.00..146.58 rows=100 width=88)
 Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
 Relations: (public.t_a) INNER JOIN (public.t_b)
 Remote SQL: SELECT r1.id, r1.x, r1.y, r2.id, r2.x, r2.y FROM (public.t 
r1 INNER JOIN public.t r2 ON (((r1.x < r2.x)) AND ((r1.id = r2.id
(6 rows)


On the other hands, I noticed it is not safe to attach LIMIT clause at
the planner stage because root->limit_tuples is declared as double.
Even if LIMIT clause takes a constant value, it is potentially larger
than 2^53 which is the limitation we can represent accurately with
float64 data type but LIMIT clause allows up to 2^63-1.
So, postgres_fdw now attaches LIMIT clause on the remote query on
execution time only.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei <kai...@ak.jp.nec.com>


> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Thursday, November 10, 2016 3:08 AM
> To: Kaigai Kouhei(海外 浩平) <kai...@ak.jp.nec.com>
> Cc: pgsql-hackers@postgresql.org; Jeevan Chalke
> <jeevan.cha...@enterprisedb.com>; Etsuro Fujita
> <fujita.ets...@lab.ntt.co.jp>; Andres Freund <and...@anarazel.de>
> Subject: ##freemail## Re: PassDownLimitBound for ForeignScan/CustomScan
> [take-2]
> 
> On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai <kai...@ak.jp.nec.com>
> wrote:
> > As an example, I enhanced postgres_fdw to understand the ps_numTuples
> > if it is set. If and when remote ORDER BY is pushed down, the latest
> > code tries to sort the entire remote table because it does not know
> > how many rows to be returned. Thus, it took larger execution time.
> > On the other hands, the patched one runs the remote query with LIMIT
> > clause according to the ps_numTuples; which is informed by the Limit
> > node on top of the ForeignScan node.
> 
> So there are two cases here.  If the user says LIMIT 12, we could in theory
> know that at planner time and optimize accordingly.  If the user says LIMIT
> twelve(), however, we will need to wait until execution time unless twelve()
> happens to be capable of being simplified to a constant by the planner.
> 
> Therefore, it's possible to imagine having two mechanisms here. In the
> simple case where the LIMIT and OFFSET values are constants, we could
> implement a system to get hold of that information during planning and
> use it for whatever we like.   In addition, we can have an
> execution-time system that optimizes based on values available at execution
> (regardless of whether those values were also available during planning).
> Thos

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Kouhei Kaigai
> On Tue, Nov 8, 2016 at 6:54 AM, Jeevan Chalke
>  wrote:
> > 1. ps_numTuples is declared as long, however offset and count members in
> > LimitState struct and bound member in SortState struct is int64.  However
> > long on 32 bit machine may be 32 bits and thus I think tuples_needed which
> > is long may have overflow hazards as it may store int64 + int64.  I think
> > ps_numTuples should be int64.
> 
> I suggested long originally because that's what ExecutorRun() was
> using at the time.  It seems that it got changed to uint64 in
> 23a27b039d94ba359286694831eafe03cd970eef, so I guess we should
> probably use uint64.
> 
> > 2. Robert suggested following in the previous discussion:
> > "For example, suppose we add a new PlanState member "long
> > numTuples" where 0 means that the number of tuples that will be needed
> > is unknown (so that most node types need not initialize it), a
> > positive value is an upper bound on the number of tuples that will be
> > fetched, and -1 means that it is known for certain that we will need
> > all of the tuples."
> >
> > We should have 0 for the default case so that we don't need to initialize it
> > at most of the places.  But I see many such changes in the patch.  I think
> > this is not possible here since 0 can be a legal user provided value which
> > cannot be set as a default (default is all rows).
> >
> > However do you think, can we avoid that? Is there any other way so that we
> > don't need every node having ps_numTuples to be set explicitly?
> 
> +1.
>
I thought we have to distinguish a case if LIMIT 0 is supplied.
However, in this case, ExecLimit() never goes down to the child nodes,
thus, its ps_numTuples shall not be referenced anywhere.

OK, I'll use uint64 for ps_numTuples, and 0 shall be a usual default
value that means no specific number of rows are given.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, November 10, 2016 3:08 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org; Jeevan Chalke; Etsuro Fujita; Andres Freund
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]
> 
> On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > As an example, I enhanced postgres_fdw to understand the ps_numTuples
> > if it is set. If and when remote ORDER BY is pushed down, the latest
> > code tries to sort the entire remote table because it does not know
> > how many rows to be returned. Thus, it took larger execution time.
> > On the other hands, the patched one runs the remote query with LIMIT
> > clause according to the ps_numTuples; which is informed by the Limit
> > node on top of the ForeignScan node.
> 
> So there are two cases here.  If the user says LIMIT 12, we could in
> theory know that at planner time and optimize accordingly.  If the
> user says LIMIT twelve(), however, we will need to wait until
> execution time unless twelve() happens to be capable of being
> simplified to a constant by the planner.
> 
> Therefore, it's possible to imagine having two mechanisms here. In the
> simple case where the LIMIT and OFFSET values are constants, we could
> implement a system to get hold of that information during planning and
> use it for whatever we like.   In addition, we can have an
> execution-time system that optimizes based on values available at
> execution (regardless of whether those values were also available
> during planning).  Those are, basically, two separate things, and this
> patch has enough to do just focusing on one of them.
>
OK, we need to have a private value of postgres_fdw to indicate whether
LIMIT and OFFSET were supplied at the planner stage. If any, it has to
be matched with the ps_numTuples informed at the executor stage.

I'll revise the patch.
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-11-01 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dilip Kumar
> Sent: Saturday, October 29, 2016 3:48 PM
> To: Andres Freund
> Cc: Tom Lane; Alvaro Herrera; pgsql-hackers
> Subject: Re: [HACKERS] Proposal: scan key push down to heap [WIP]
> 
> On Wed, Oct 26, 2016 at 12:01 PM, Andres Freund  wrote:
> > The gains are quite noticeable in some cases. So if we can make it work
> > without noticeable downsides...
> >
> > What I'm worried about though is that this, afaics, will quite
> > noticeably *increase* total cost in cases with a noticeable number of
> > columns and a not that selective qual. The reason for that being that
> > HeapKeyTest() uses heap_getattr(), whereas upper layers use
> > slot_getattr(). The latter "caches" repeated deforms, the former
> > doesn't... That'll lead to deforming being essentially done twice, and
> > it's quite often already a major cost of query processing.
> 
> What about putting slot reference inside HeapScanDesc ?. I know it
> will make ,heap layer use executor structure but just a thought.
> 
> I have quickly hacked this way where we use slot reference in
> HeapScanDesc and directly use
>  slot_getattr inside HeapKeyTest (only if we have valid slot otherwise
> use _heap_getattr) and measure the worst case performance (what you
> have mentioned above.)
> 
> My Test: (21 column table with varchar in beginning + qual is on last
> few column + varying selectivity )
> 
> postgres=# \d test
>   Table "public.test"
>  Column |   Type| Modifiers
> +---+---
>  f1 | integer   |
>  f2 | character varying |
>  f3 | integer   |
>  f4 | integer   |
>  f5 | integer   |
>  f6 | integer   |
>  f7 | integer   |
>  f8 | integer   |
>  f9 | integer   |
>  f10| integer   |
>  f11| integer   |
>  f12| integer   |
>  f13| integer   |
>  f14| integer   |
>  f15| integer   |
>  f16| integer   |
>  f17| integer   |
>  f18| integer   |
>  f19| integer   |
>  f20| integer   |
>  f21| integer   |
> 
> tuple count : 1000 (10 Million)
> explain analyze select * from test where f21< $1 and f20 < $1 and f19
> < $1 and f15 < $1 and f10 < $1; ($1 vary from 1Million to 1Million).
> 
> Target code base:
> ---
> 1. Head
> 2. Heap_scankey_pushdown_v1
> 3. My hack for keeping slot reference in HeapScanDesc
> (v1+use_slot_in_HeapKeyTest)
> 
> Result:
> Selectivity Head   scan_key_pushdown_v1 v1+use_slot_in_HeapKeyTest
> 0.1 3880  2980 2747
> 0.2 4041  3187 2914
> 0.5 5051  4921 3626
> 0.8 5378  7296 3879
> 1.0 6161  8525 4575
> 
> Performance graph is attached in the mail..
> 
> Observation:
> 
> 1. Heap_scankey_pushdown_v1, start degrading after very high
> selectivity (this behaviour is only visible if table have 20 or more
> columns, I tested with 10 columns but with that I did not see any
> regression in v1).
> 
> 2. (v1+use_slot_in_HeapKeyTest) is always winner, even at very high 
> selectivity.
> 
Prior to this interface change, it may be a starting point to restrict scan key
pushdown only when OpExpr references the column with static attcacheoff.
This type of column does not need walks on tuples from the head, thus, tuple
deforming cost will not be a downside.

By the way, I'm a bit skeptical whether this enhancement is really beneficial
than works for this enhancement, because we can now easily increase the number
of processor cores to run seq-scan with qualifier, especially, when it has high
selectivity.
How about your thought?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2016-10-31 Thread Kouhei Kaigai
Hello,

The attached patch implements the suggestion by Amit before.

What I'm motivated is to collect extra run-time statistics specific
to a particular ForeignScan/CustomScan, not only the standard
Instrumentation; like DMA transfer rate or execution time of GPU
kernels in my case.

Per-node DSM toc is one of the best way to return run-time statistics
to the master backend, because FDW/CSP can assign arbitrary length of
the region according to its needs. It is quite easy to require.
However, one problem is, the per-node DSM toc is already released when
ExecEndNode() is called on the child node of Gather.

This patch allows extensions to get control on the master backend's
context when all the worker node gets finished but prior to release
of the DSM segment. If FDW/CSP has its special statistics on the
segment, it can move to the private memory area for EXPLAIN output
or something other purpose.

One design consideration is whether the hook shall be called from
ExecParallelRetrieveInstrumentation() or ExecParallelFinish().
The former is a function to retrieve the standard Instrumentation
information, thus, it is valid only if EXPLAIN ANALYZE.
On the other hands, if we put entrypoint at ExecParallelFinish(),
extension can get control regardless of EXPLAIN ANALYZE, however,
it also needs an extra planstate_tree_walker().

Right now, we don't assume anything onto the requirement by FDW/CSP.
It may want run-time statistics regardless of EXPLAIN ANALYZE, thus,
hook shall be invoked always when Gather node confirmed termination
of the worker processes.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


> -Original Message-
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> Sent: Monday, October 17, 2016 11:22 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers
> Subject: ##freemail## Re: [HACKERS] Steps inside ExecEndGather
> 
> On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > Hello,
> >
> > I'm now trying to carry extra performance statistics on CustomScan
> > (like DMA transfer rate, execution time of GPU kernels, etc...)
> > from parallel workers to the leader process using the DSM segment
> > attached by the parallel-context.
> > We can require an arbitrary length of DSM using ExecCustomScanEstimate
> > hook by extension, then it looks leader/worker can share the DSM area.
> > However, we have a problem on this design.
> >
> > Below is the implementation of ExecEndGather().
> >
> >   void
> >   ExecEndGather(GatherState *node)
> >   {
> >   ExecShutdownGather(node);
> >   ExecFreeExprContext(>ps);
> >   ExecClearTuple(node->ps.ps_ResultTupleSlot);
> >   ExecEndNode(outerPlanState(node));
> >   }
> >
> > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> > The DSM segment shall be released on this call, so child node cannot
> > reference the DSM at the time of ExecEndNode().
> >
> 
> Before releasing DSM, we do collect all the statistics or
> instrumentation information of each node.  Refer
> ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
> wondering why can't you collect the additional information in the same
> way?
> 
> 
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com


parallel-finish-fdw_csp.v1.patch
Description: parallel-finish-fdw_csp.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-10-31 Thread Kouhei Kaigai
Hello,

The attached patch is revised version of the pass-down-bounds feature.
Its functionality is not changed from the previous version, however,
implementation was revised according to the discussion at the last CF.

This patch add a new fields (ps_numTuples) to the PlanState. This is
a hint for optimization when parent node needs first N-tuples only.
It shall be set prior to ExecProcNode() after ExecInitNode() or
ExecReScan() by the parent node, then child nodes can adjust its
execution behavior (e.g, Sort will take top-N heapsort if ps_numTuples
is set) and pass down the hint to its child nodes furthermore.

As an example, I enhanced postgres_fdw to understand the ps_numTuples
if it is set. If and when remote ORDER BY is pushed down, the latest
code tries to sort the entire remote table because it does not know
how many rows to be returned. Thus, it took larger execution time.
On the other hands, the patched one runs the remote query with LIMIT
clause according to the ps_numTuples; which is informed by the Limit
node on top of the ForeignScan node.

* without patch
=
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
 QUERY PLAN

 Limit  (cost=100.00..100.43 rows=10 width=52) (actual time=2332.548..2332.550 
rows=10 loops=1)
   Output: id, x, y, z
   ->  Foreign Scan on public.ft  (cost=100.00..146.46 rows=1077 width=52) 
(actual time=2332.547..2332.548 rows=10 loops=1)
 Output: id, x, y, z
 Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS 
LAST, y ASC NULLS LAST
 Planning time: 0.177 ms
 Execution time: 2445.590 ms
(7 rows)

* with patch
==
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
QUERY PLAN
--
 Limit  (cost=100.00..100.43 rows=10 width=52) (actual time=579.469..579.471 
rows=10 loops=1)
   Output: id, x, y, z
   ->  Foreign Scan on public.ft  (cost=100.00..146.46 rows=1077 width=52) 
(actual time=579.468..579.469 rows=10 loops=1)
 Output: id, x, y, z
 Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS 
LAST, y ASC NULLS LAST
 Planning time: 0.123 ms
 Execution time: 579.858 ms
(7 rows)


Right now, I have a few concerns for this patch.
1. Because LIMIT clause can have expression not only constant value,
   we cannot determine the value of ps_numTuples until execution time.
   So, it is not possible to adjust remote query on planning time, and
   EXPLAIN does not show exact remote query even if LIMIT clause was
   attached actually.

2. Where is the best location to put the interface contract to set
   ps_numTuples field. It has to be set prior to the first ExecProcNode()
   after ExecInitNode() or ExecReScan().

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Friday, September 16, 2016 12:39 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jeevan Chalke; pgsql-hackers@postgresql.org; Etsuro Fujita; Andres Freund
> Subject: ##freemail## Re: [HACKERS] PassDownLimitBound for 
> ForeignScan/CustomScan
> 
> On Tue, Sep 13, 2016 at 9:07 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > In the current implementation calls recompute_limits() on the first
> > invocation of ExecLimit and ExecReScanLimit. Do we expect the
> > ps->numTuples will be also passed down to the child nodes on the same
> > timing?
> 
> Sure, unless we find some reason why that's not good.
> 
> > I also think this new executor contract shall be considered as a hint
> > (but not a requirement) for the child nodes, because it allows the
> > parent nodes to re-distribute the upper limit regardless of the type
> > of the child nodes as long as the parent node can work correctly and
> > has benefit even if the child node returns a part of tuples. It makes
> > the decision whether the upper limit should be passed down much simple.
> > The child node "can" ignore the hint but can utilize for more optimization.
> 
> +1.
> 
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


passdown-limit-fdw.v1.patch
Description: passdown-limit-fdw.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Kouhei Kaigai
> On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > Hello,
> >
> > I'm now trying to carry extra performance statistics on CustomScan
> > (like DMA transfer rate, execution time of GPU kernels, etc...)
> > from parallel workers to the leader process using the DSM segment
> > attached by the parallel-context.
> > We can require an arbitrary length of DSM using ExecCustomScanEstimate
> > hook by extension, then it looks leader/worker can share the DSM area.
> > However, we have a problem on this design.
> >
> > Below is the implementation of ExecEndGather().
> >
> >   void
> >   ExecEndGather(GatherState *node)
> >   {
> >   ExecShutdownGather(node);
> >   ExecFreeExprContext(>ps);
> >   ExecClearTuple(node->ps.ps_ResultTupleSlot);
> >   ExecEndNode(outerPlanState(node));
> >   }
> >
> > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> > The DSM segment shall be released on this call, so child node cannot
> > reference the DSM at the time of ExecEndNode().
> >
> 
> Before releasing DSM, we do collect all the statistics or
> instrumentation information of each node.  Refer
> ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
> wondering why can't you collect the additional information in the same
> way?
>
Thanks for the suggestion.
Hmm. Indeed, it is more straightforward way to do, although a new hook
is needed for CSP/FDW.

What I want to collect are: DMA transfer rate between RAM<->GPU, Execution
time of GPU kernels and etc... These are obviously out of the standard
Instrumentation structure, so only CSP/FDW can know its size and format.

If we would have a callback just before the planstate_tree_walker() when
planstate is either CustomScanState or ForeignScanState, it looks to me
the problem can be solved very cleanly.

Best regards,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Kouhei Kaigai
> I'm now trying to carry extra performance statistics on CustomScan
> (like DMA transfer rate, execution time of GPU kernels, etc...)
> from parallel workers to the leader process using the DSM segment
> attached by the parallel-context.
> We can require an arbitrary length of DSM using ExecCustomScanEstimate
> hook by extension, then it looks leader/worker can share the DSM area.
> However, we have a problem on this design.
> 
> Below is the implementation of ExecEndGather().
> 
>   void
>   ExecEndGather(GatherState *node)
>   {
>   ExecShutdownGather(node);
>   ExecFreeExprContext(>ps);
>   ExecClearTuple(node->ps.ps_ResultTupleSlot);
>   ExecEndNode(outerPlanState(node));
>   }
> 
> It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> The DSM segment shall be released on this call, so child node cannot
> reference the DSM at the time of ExecEndNode().
> 
> Is there some technical reason why parallel context needs to be released
> prior to ExecEndNode() of the child nodes? Or, just convention of coding?
> 
> I think I'm not an only person who wants to use DSM of CustomScan to write
> back something extra status of parallel workers.
> How about an idea to move ExecShutdownGather() after the ExecEndNode()?
> 
> To avoid this problem, right now, I allocate an another DSM then inform
> its handle to the parallel workers. This segment can be survived until
> ExecEndCustomScan(), but not best effective way, of course.
>
My analysis was not collect a bit.

ExecShutdownNode() at ExecutePlan() is the primary point to call
ExecShutdownGather(), thus, parallel context shall not survive at the point
of ExecEndPlan() regardless of the implementation of ExecEndGather.

Hmm, what is the best way to do...? Or, is it completely abuse of DSM that
is setup by the parallel context?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Kouhei Kaigai
Hello,

I'm now trying to carry extra performance statistics on CustomScan
(like DMA transfer rate, execution time of GPU kernels, etc...)
from parallel workers to the leader process using the DSM segment
attached by the parallel-context.
We can require an arbitrary length of DSM using ExecCustomScanEstimate
hook by extension, then it looks leader/worker can share the DSM area.
However, we have a problem on this design.

Below is the implementation of ExecEndGather().

  void
  ExecEndGather(GatherState *node)
  {
  ExecShutdownGather(node);
  ExecFreeExprContext(>ps);
  ExecClearTuple(node->ps.ps_ResultTupleSlot);
  ExecEndNode(outerPlanState(node));
  }

It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
The DSM segment shall be released on this call, so child node cannot
reference the DSM at the time of ExecEndNode().

Is there some technical reason why parallel context needs to be released
prior to ExecEndNode() of the child nodes? Or, just convention of coding?

I think I'm not an only person who wants to use DSM of CustomScan to write
back something extra status of parallel workers.
How about an idea to move ExecShutdownGather() after the ExecEndNode()?

To avoid this problem, right now, I allocate an another DSM then inform
its handle to the parallel workers. This segment can be survived until
ExecEndCustomScan(), but not best effective way, of course.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] palloc() too large on pg_buffercache with large shared_buffers

2016-09-14 Thread Kouhei Kaigai
> On Wed, Sep 14, 2016 at 12:13 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > It looks to me pg_buffercache tries to allocate more than 1GB using
> > palloc(), when shared_buffers is more than 256GB.
> >
> > # show shared_buffers ;
> >  shared_buffers
> > 
> >  280GB
> > (1 row)
> >
> > # SELECT buffers, d.datname, coalesce(c.relname, '???')
> > FROM (SELECT count(*) buffers, reldatabase, relfilenode
> > FROM pg_buffercache group by reldatabase, relfilenode) b
> >LEFT JOIN pg_database d ON d.oid = b.reldatabase
> >LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database
> >  WHERE datname = current_database())
> >AND b.relfilenode = pg_relation_filenode(c.oid)
> >ORDER BY buffers desc;
> > ERROR:  invalid memory alloc request size 1174405120
> >
> > It is a situation to use MemoryContextAllocHuge(), instead of palloc().
> > Also, it may need a back patching?
> 
> I guess so.  Although it's not very desirable for it to use that much
> memory, I suppose if you have a terabyte of shared_buffers you
> probably have 4GB of memory on top of that to show what they contain.
>
Exactly. I found this problem when a people asked me why shared_buffers=280GB
is slower than shared_buffers=128MB to scan 350GB table.
As I expected, most of shared buffers are not in-use and it also reduced
amount of free memory; usable for page-cache.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] palloc() too large on pg_buffercache with large shared_buffers

2016-09-13 Thread Kouhei Kaigai
Hello,

It looks to me pg_buffercache tries to allocate more than 1GB using
palloc(), when shared_buffers is more than 256GB.

# show shared_buffers ;
 shared_buffers

 280GB
(1 row)

# SELECT buffers, d.datname, coalesce(c.relname, '???')
FROM (SELECT count(*) buffers, reldatabase, relfilenode
FROM pg_buffercache group by reldatabase, relfilenode) b
   LEFT JOIN pg_database d ON d.oid = b.reldatabase
   LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database
 WHERE datname = current_database())
   AND b.relfilenode = pg_relation_filenode(c.oid)
   ORDER BY buffers desc;
ERROR:  invalid memory alloc request size 1174405120

It is a situation to use MemoryContextAllocHuge(), instead of palloc().
Also, it may need a back patching?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-fix-pg_buffercache-palloc-huge.patch
Description: pgsql-fix-pg_buffercache-palloc-huge.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Kouhei Kaigai
> On Tue, Sep 13, 2016 at 3:48 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > Sorry for my late.
> >
> > The attached patch fixed the wording problems on SGML part.
> 
> I agree that we should have some way for foreign data wrappers and
> custom scans and perhaps also other executor nodes to find out whether
> there's a known limit to the number of tuples that they might need to
> produce, but I wonder if we should be doing something more general
> than this.  For example, suppose we add a new PlanState member "long
> numTuples" where 0 means that the number of tuples that will be needed
> is unknown (so that most node types need not initialize it), a
> positive value is an upper bound on the number of tuples that will be
> fetched, and -1 means that it is known for certain that we will need
> all of the tuples.  This might be relevant to the executor batching
> stuff that Andres has been working on, because you could for example
> set ps->numTuples == 1 on the inner side of a semi-join, warning the
> executor node that it shouldn't bother trying to batch anything.
>
I also think the generic approach is a preferable direction.

In the current implementation calls recompute_limits() on the first
invocation of ExecLimit and ExecReScanLimit. Do we expect the
ps->numTuples will be also passed down to the child nodes on the same
timing?
I also think this new executor contract shall be considered as a hint
(but not a requirement) for the child nodes, because it allows the
parent nodes to re-distribute the upper limit regardless of the type
of the child nodes as long as the parent node can work correctly and
has benefit even if the child node returns a part of tuples. It makes
the decision whether the upper limit should be passed down much simple.
The child node "can" ignore the hint but can utilize for more optimization.

> On a more practical level, I notice that you haven't adapted
> postgres_fdw or file_fdw to benefit from this new callback.  It seems
> like postgres_fdw could benefit, because it could fetch only the
> required number of tuples if that happens to be a smaller number than
> the configured fetch_size.
>
It is because of just my time pressure around the patch submission days.
I'll try to enhance postgres_fdw as a usage of this run-time optimization.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Kouhei Kaigai
Sorry for my late.

The attached patch fixed the wording problems on SGML part.

Best regards,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Jeevan Chalke [mailto:jeevan.cha...@enterprisedb.com]
> Sent: Tuesday, September 06, 2016 11:22 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org; Etsuro Fujita
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> 
> Hi,
> 
> Changes look good to me.
> 
> However there are couple of minor issues need to be fixed.
> 
> 1.
> "under" repeated on second line. Please remove.
> +if and when CustomScanState is located under
> +under LimitState; which implies the underlying node is not
> 
> 2.
> Typo: dicsussion => discussion
> Please fix.
> 
> Apart from this I see no issues.
> 
> 
> Thanks
> 
> 
> --
> 
> Jeevan B Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
> 



pgsql-v10-fdw-css-limit-bound.v3.patch
Description: pgsql-v10-fdw-css-limit-bound.v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-05 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Monday, September 05, 2016 12:58 PM
> To: Jeevan Chalke
> Cc: pgsql-hackers@postgresql.org; Etsuro Fujita
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> 
> > On Mon, Aug 29, 2016 at 7:25 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> >
> >
> > Hello,
> >
> > The attached patch adds an optional callback to support special 
> > optimization
> > if ForeignScan/CustomScan are located under the Limit node in plan-tree.
> >
> > Our sort node wisely switches the behavior when we can preliminary know
> > exact number of rows to be produced, because all the Sort node has to
> > return is the top-k rows when it is located under the Limit node.
> > It is much lightweight workloads than sorting of entire input rows when
> > nrows is not small.
> >
> > In my case, this information is very useful because GPU can complete its
> > sorting operations mostly on L1-grade memory if we can preliminary know
> > the top-k value is enough small and fits to size of the fast memory.
> >
> > Probably, it is also valuable for Fujita-san's case because this 
> > information
> > allows to attach "LIMIT k" clause on the remote query of postgres_fdw.
> > It will reduce amount of the network traffic and remote CPU consumption
> > once we got support of sort pushdown.
> >
> >
> >
> > One thing we need to pay attention is cost estimation on the planner 
> > stage.
> > In the existing code, only create_ordered_paths() and
> > create_merge_append_path()
> > considers the limit clause for cost estimation of sorting. They use the
> > 'limit_tuples' of PlannerInfo; we can reference the structure when 
> > extension
> > adds ForeignPath/CustomPath, so I think we don't need a special 
> > enhancement
> > on the planner stage.
> >
> Thanks for your comments.
> 
> > I believe this hook is gets called at execution time.
> > So to push LIMIT clause like you said above we should use "limit_tuples" at 
> > the time
> > of planning and then use this hook to optimize at runtime, right?
> >
> Yes. For more correctness, a valid "limit_tuples" of PlannerInfo is set only 
> when
> LIMIT clause takes constant values; it is true for most of use case.
> Then, the hook I added shall be called at execution time for more exact 
> optimization.
> 
> If FDW/CSP cannot accept uncertain number of rows to generate on planning 
> time,
> it is not a duty to provide its own path which is optimized for small number 
> of
> LIMIT clause.
> 
> > Apart from that, attached patch applies cleanly on latest sources and found 
> > no issues
> > with make or with regressions.
> >
> > However this patch is an infrastructure for any possible optimization when
> > foreign/customscan is under LIMIT.
> >
> > So look good to me.
> >
> > I quickly tried adding a hook support in postgres_fdw, and it gets called 
> > correctly
> > when we have foreignscan with LIMIT (limit being evaluated on local server).
> >
> > So code wise no issue. Also add this hook details in documentation.
> >
> OK, I'll try to write up some detailed documentation stuff; not only API 
> specification.
>
The v2 patch attached. It introduces the role of this hook and how extension
utilizes the LIMIT clause for its further optimization on planning and
execution time.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


pgsql-v10-fdw-css-limit-bound.v2.patch
Description: pgsql-v10-fdw-css-limit-bound.v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-04 Thread Kouhei Kaigai
> On Mon, Aug 29, 2016 at 7:25 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> 
> 
>   Hello,
> 
>   The attached patch adds an optional callback to support special 
> optimization
>   if ForeignScan/CustomScan are located under the Limit node in plan-tree.
> 
>   Our sort node wisely switches the behavior when we can preliminary know
>   exact number of rows to be produced, because all the Sort node has to
>   return is the top-k rows when it is located under the Limit node.
>   It is much lightweight workloads than sorting of entire input rows when
>   nrows is not small.
> 
>   In my case, this information is very useful because GPU can complete its
>   sorting operations mostly on L1-grade memory if we can preliminary know
>   the top-k value is enough small and fits to size of the fast memory.
> 
>   Probably, it is also valuable for Fujita-san's case because this 
> information
>   allows to attach "LIMIT k" clause on the remote query of postgres_fdw.
>   It will reduce amount of the network traffic and remote CPU consumption
>   once we got support of sort pushdown.
> 
> 
> 
>   One thing we need to pay attention is cost estimation on the planner 
> stage.
>   In the existing code, only create_ordered_paths() and
> create_merge_append_path()
>   considers the limit clause for cost estimation of sorting. They use the
>   'limit_tuples' of PlannerInfo; we can reference the structure when 
> extension
>   adds ForeignPath/CustomPath, so I think we don't need a special 
> enhancement
>   on the planner stage.
>
Thanks for your comments.

> I believe this hook is gets called at execution time.
> So to push LIMIT clause like you said above we should use "limit_tuples" at 
> the time
> of planning and then use this hook to optimize at runtime, right?
>
Yes. For more correctness, a valid "limit_tuples" of PlannerInfo is set only 
when
LIMIT clause takes constant values; it is true for most of use case.
Then, the hook I added shall be called at execution time for more exact 
optimization.

If FDW/CSP cannot accept uncertain number of rows to generate on planning time,
it is not a duty to provide its own path which is optimized for small number of
LIMIT clause.

> Apart from that, attached patch applies cleanly on latest sources and found 
> no issues
> with make or with regressions.
> 
> However this patch is an infrastructure for any possible optimization when
> foreign/customscan is under LIMIT.
> 
> So look good to me.
> 
> I quickly tried adding a hook support in postgres_fdw, and it gets called 
> correctly
> when we have foreignscan with LIMIT (limit being evaluated on local server).
> 
> So code wise no issue. Also add this hook details in documentation.
>
OK, I'll try to write up some detailed documentation stuff; not only API 
specification.

Best regards,

> 
> Thanks
> 
> 
> 
>   Thanks,
>   --
>   NEC Business Creation Division / PG-Strom Project
>   KaiGai Kohei <kai...@ak.jp.nec.com>
> 
> 
> 
>   --
>   Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>   To make changes to your subscription:
>   http://www.postgresql.org/mailpref/pgsql-hackers
> <http://www.postgresql.org/mailpref/pgsql-hackers>
> 
> 
> 
> 
> 
> 
> --
> 
> Jeevan B Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-08-28 Thread Kouhei Kaigai
Hello,

The attached patch adds an optional callback to support special optimization
if ForeignScan/CustomScan are located under the Limit node in plan-tree.

Our sort node wisely switches the behavior when we can preliminary know
exact number of rows to be produced, because all the Sort node has to
return is the top-k rows when it is located under the Limit node.
It is much lightweight workloads than sorting of entire input rows when
nrows is not small.

In my case, this information is very useful because GPU can complete its
sorting operations mostly on L1-grade memory if we can preliminary know
the top-k value is enough small and fits to size of the fast memory.

Probably, it is also valuable for Fujita-san's case because this information
allows to attach "LIMIT k" clause on the remote query of postgres_fdw.
It will reduce amount of the network traffic and remote CPU consumption
once we got support of sort pushdown.

One thing we need to pay attention is cost estimation on the planner stage.
In the existing code, only create_ordered_paths() and create_merge_append_path()
considers the limit clause for cost estimation of sorting. They use the
'limit_tuples' of PlannerInfo; we can reference the structure when extension
adds ForeignPath/CustomPath, so I think we don't need a special enhancement
on the planner stage.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v10-fdw-css-limit-bound.v1.patch
Description: pgsql-v10-fdw-css-limit-bound.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] comment fix for CUSTOMPATH_* flags

2016-08-28 Thread Kouhei Kaigai
Hello,

I noticed the source code comment around CustomPath structure says "see above"
for definition of CUSTOMPATH_* flags. It was originally right, but it was moved
to nodes/extensible.h on the further development. So, no comments are above.
The attached patch corrects the comment for the right location.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-v9.6-custom-flags-comments-fixup.patch
Description: pgsql-v9.6-custom-flags-comments-fixup.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-04 Thread Kouhei Kaigai
> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Thursday, August 04, 2016 4:42 PM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/08/02 22:02, Kouhei Kaigai wrote:
> 
> I wrote:
> >> I removed the Relations line.  Here is an updated version of the patch.
> >>
> >> * As I said upthread, I left the upper-relation handling for another
> >> patch.  Currently, the patch prints "Foreign Scan" with no relations in
> >> that case.
> >>
> >> * I kept custom joins as-is.  We would need discussions about how to
> >> choose relations we print in EXPLAIN, so I'd also like to leave that for
> >> yet another patch.
> 
> > Please don't rely on fs_relids bitmap to pick up relations to be printed.
> > It just hold a set of underlying relations, but it does not mean all of
> > these relations are actually scanned inside of the ForeignScan.
> 
> Firstly, I'd like to discuss about what the relations printed after
> "Foreign join" would mean.  I think they would mean the relations
> involved in that foreign join (in other words, the ones that participate
> in that foreign join) rather than the relations to be scanned by a
> Foreign Scan representing that foreign join.  Wouldn't that make sense?
> 
> In that sense I think it's reasonable to print all relations specified
> in fs_relids after "Foreign Join".  (And in that sense I was thinking we
> could handle the custom join case the same way as the foreign join case.)
> 
> > You didn't answer the following scenario I pointed out in the upthread.
> >
> > | Please assume an enhancement of postgres_fdw that reads a small local 
> > table (tbl_1)
> > | and parse them as VALUES() clause within a remote query to execute remote 
> > join
> > | with foreign tables (ftbl_2, ftbl_3).
> > | This ForeignScan node has three underlying relations; tbl_1, ftbl_2 and 
> > ftbl_3.
> > | Likely, tbl_1 will be scanned by SeqScan, not ForeignScan itself.
> > | In this case, which relations shall be printed around ForeignScan?
> > | Is it possible to choose proper relation names without hint by the 
> > extension?
> > |
> >
> > To care about these FDW usage, you should add an alternative bitmap rather
> > than fs_relids/custom_relids. My suggestion is, having explain_relids for
> > the purpose.
> 
> We currently don't allow such a join to be performed as a foreign join,
> because we allow a join to be performed so if the relations of the join
> are all foreign tables belonging to the same remote server, as you know.
> 
> So, as I said upthread, I think it would make sense to introduce such a
> bitmap when we extend the existing foreign-join pushdown infrastructure
> so that we can do such a thing as proposed by you.  I guess that that
> would need to extend the concept of foreign joins, though.
>
OK, right now, FDW does not support to take arbitrary child nodes, unlike
CustomScan. However, it implies your proposed infrastructure also has to
be revised once FDW gets enhanced in the near future.

> > Also, the logic to print "Foreign (Scan|Insert|Update|Delete)" is different
> > from what I suggested. I'm suggesting to allow extension giving a label
> > to fill up "Foreign %s" format.
> >
> > Please explain why your choice is better than my proposition.
> 
> No, I haven't done anything about that yet.  I kept the behavior as-is.
> 
> > At least, my proposition is available to apply on both of foreign-scan and
> > custom-scan, and no need to future maintenance if and when FDW gets support
> > remote Aggregation, Sort or others.
> 
> I'd like to discuss this issue separately, maybe in a new thread.
>
Why do you try to re-invent a similar infrastructure twice and separately?

What I proposed perfectly covers what you want to do, and has more benefits.
- A common manner for both of ForeignScan and CustomScan
- Flexibility to control "Foreign XXX" label and relation names to be printed.

Even if it is sufficient for the current usage of FDW, I've been saying your
proposition is not sufficient for CustomScan nowadays, and ForeignScan in the
near future.
It is not an answer to ignore the CustomScan side, because we have to enhanced
the infrastructure of CustomScan side to follow up FDW sooner or later.
However, we will have to apply a different manner on CustomScan side, or 
overwrite
what you proposed on FDW side, at that time.
It is not a desirable future.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-02 Thread Kouhei Kaigai
> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Tuesday, August 02, 2016 9:36 PM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/08/01 20:15, Etsuro Fujita wrote:
> > I thought about the Relations line a bit more and noticed that there are
> > cases where the table reference names for joining relations in the
> > Relations line are printed incorrectly.  Here is an example:
> >
> > postgres=# explain verbose select * from (select t1.a, t2.a from ft1 t1,
> > ft2 t2 where t1.a = t2.a union select t1.a, t2.a from ft1 t1, ft2 t2
> > where t1.a = t2.a) as t(t1a, t2a);
> >  QUERY PLAN
> >
> --
> --
> >
> >  Unique  (cost=204.12..204.13 rows=2 width=8)
> >Output: t1.a, t2.a
> >->  Sort  (cost=204.12..204.12 rows=2 width=8)
> >  Output: t1.a, t2.a
> >  Sort Key: t1.a, t2.a
> >  ->  Append  (cost=100.00..204.11 rows=2 width=8)
> >->  Foreign Scan  (cost=100.00..102.04 rows=1 width=8)
> >  Output: t1.a, t2.a
> >  Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> >  Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1
> > INNER JOIN public.t2 r2 ON (((r1.a = r2.a
> >->  Foreign Scan  (cost=100.00..102.04 rows=1 width=8)
> >  Output: t1_1.a, t2_1.a
> >  Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> >  Remote SQL: SELECT r1.a, r2.a FROM (public.t1 r1
> > INNER JOIN public.t2 r2 ON (((r1.a = r2.a
> > (14 rows)
> >
> > The table reference names for ft1 and ft2 in the Relations line for the
> > second Foreign Scan should be t1_1 and t2_1 respectively.
> >
> > Another concern about the Relations line is, that represents just an
> > internal representation of a pushed-down join, so that would not
> > necessarily match a deparsed query shown in the Remote SQL line.  Here
> > is an example, which I found when working on supporting pushing down
> > full outer join a lot more, by improving the deparsing logic so that
> > postgres_fdw can build a remote query that involves subqueries [1],
> > which I'll work on for 10.0:
> >
> > + -- full outer join with restrictions on the joining relations
> > + EXPLAIN (COSTS false, VERBOSE)
> > + SELECT t1.c1, t2.c1 FROM (SELECT c1 FROM ft4 WHERE c1 BETWEEN 50 AND
> > 60) t1 FULL JOIN (SELECT c1 FROM ft5 WHERE c1 BETWEEN 50 AND 60) t2 ON
> > (t1.c1 = t2.c1) ORDER BY t1.c1, t2.c1;
> > + QUERY
> > PLAN
> > +
> >
> --
> --
> --
> -
> >
> > +  Foreign Scan
> > +Output: ft4.c1, ft5.c1
> > +Relations: (public.ft4) FULL JOIN (public.ft5)
> > +Remote SQL: SELECT ss1.c1, ss2.c1 FROM ((SELECT c1 FROM "S 1"."T 3"
> > WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss1(c1) FULL JOIN (SELECT c1 FROM
> > "S 1"."T 4" WHERE ((c1 >= 50)) AND ((c1 <= 60))) ss2(c1) ON (((ss1.c1 =
> > ss2.c1 ORDER BY ss1.c1 ASC NULLS LAST, ss2.c1 ASC NULLS LAST
> > + (4 rows)
> >
> > "(public.ft4) FULL JOIN (public.ft5)" in the Relations line does not
> > exactly match the deparsed query in the Remote SQL line, which I think
> > would be rather confusing for users.  (We may be able to print more
> > exact information in the Relations line so as to match the depaserd
> > query, but I think that that would make the Relations line redundant.)
> >
> > Would we really need the Relations line?  If joining relations are
> > printed by core like "Foreign Join on public.ft1 t1_1, public.ft2 t2_1"
> > as proposed upthread, we can see those relations from that, not the
> > Relations line.  Also we can see the join tree structure from the
> > deparsed query in the Remote SQL line.  The Relations line seems to be
> > not that useful anymore, then.  What do you think about that?
> 
> I removed the Relations line.  Here is an updated version of the patch.
> 
> * As I said upthread, I left the upper-relation handling for another
> patch.  Currently, the patch prints "Foreign Scan" with no relations in
> that case.
> 
> * I kept custom joins as-is.  We would need discussions about how to
> choose relations we print in EXPLAIN, so I'd also like to leave that for
> yet another patch.
>
Please don't rely on fs_relids bitmap to pick up relations to be printed.
It just hold a set of underlying relations, but it does not mean all of
these relations are actually scanned inside of the ForeignScan.

You 

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-02 Thread Kouhei Kaigai
> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Tuesday, August 02, 2016 2:45 PM
> To: Kaigai Kouhei(海外 浩平); Ashutosh Bapat
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/08/02 13:32, Kouhei Kaigai wrote:
> 
> I wrote:
> >> My concern here is EXPLAIN for foreign joins.  I think it's another
> >> problem how we handle Foreign Scan plan nodes representing
> >> post-scan/join operations in EXPLAIN, so I'd like to leave that for
> >> another patch.
> 
> > What is the post-scan/join operations? Are you saying EPQ recheck and
> > alternative local join plan?
> 
> No.  I mean e.g., aggregation, window functions, sorting, or table
> modification.  In other words, Foreign Scan plan nodes created from
> ForeignPath paths created from GetForeignUpperPaths.
>
Why do you need to handle these upper paths individually?
FDW driver knows the remote query contains aggregation, window functions,
sorting, or table modification. It can give proper label.

If remote query contains partial aggregation and relations join, for
example, "Partial Agg + Scan" will be a proper label that shall be
followed by the "Foreign %s".

All you need to do are the two enhancements:
- Add "const char *explain_label" on the ForeignScanState or somewhere
  extension can set. It gives a label to be printed.
  If NULL string is set, EXPLAIN shows "Foreign Scan" as a default.
- Add "Bitmapset explain_rels" on the ForeignScanState or somewhere
  extension can set. It indicates the relations to be printed after
  the "Foreign %s" token. If you want to print all the relations names
  underlying this ForeignScan node, just copy scanrelids bitmap.
  If NULL bitmap is set, EXPLAIN shows nothing as a default.

Please note that the default does not change the existing behavior.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-01 Thread Kouhei Kaigai
> On 2016/08/01 22:25, Kouhei Kaigai wrote:
> 
> I wrote:
> >>> a broader
> >>> word like "Processing" seems to work well because we allow various
> >>> kinds of operations to the remote side, in addition to scans/joins,
> >>> to be performed in that one Foreign Scan node indicated by "Foreign
> >>> Processing", such as aggregation, window functions, distinct, order
> >>> by, row locking, table modification, or combinations of them.
> 
>  >> On 2016/07/29 13:28, Ashutosh Bapat wrote:
> >>> "Scan" is a better word than "Processing". From plan's perspective it's
> >>> ultimately a Scan (on the data produced by the foreign server) and not
> >>> processing.
> 
> I wrote:
> >> Exactly, but one thing I'm concerned about is the table modification
> >> case; the EXPLAIN output would be something like this:
> >>
> >>Foreign Scan
> >>  Remote SQL: INSERT INTO remote_table VALUES ...
> >>
> >> That would be strange, so I think a more broader word is better.  I
> >> don't think "Foreign Processing" is best.  "Foreign Upper" might be much
> >> better because the corresponding path is created by calling
> >> GetForeignUpperPaths.
> 
> > Be "Foreign %s", and gives "Insert" label by postgres_fdw; which knows
> > the ForeignScan node actually insert tuples.
> > From the standpoint of users, it looks "Foreign Insert".
> 
> My concern here is EXPLAIN for foreign joins.  I think it's another
> problem how we handle Foreign Scan plan nodes representing
> post-scan/join operations in EXPLAIN, so I'd like to leave that for
> another patch.
>
What is the post-scan/join operations? Are you saying EPQ recheck and
alternative local join plan?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-01 Thread Kouhei Kaigai
> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Monday, August 01, 2016 8:26 PM
> To: Ashutosh Bapat
> Cc: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/07/29 13:28, Ashutosh Bapat wrote:
> 
> I wrote:
> > Probably something like this:
> >
> >Foreign Processing
> >  Remote Operations: ...
> >
> > In the Remote Operations line, the FDW/extension could print
> > any info
> > about remote operations, eg, "Scan/Join + Aggregate".
> 
> I wrote:
> > I intentionally chose that word and thought we could leave detailed
> > descriptions about remote operations to the FDW/extension; a broader
> > word like "Processing" seems to work well because we allow various
> > kinds of operations to the remote side, in addition to scans/joins,
> > to be performed in that one Foreign Scan node indicated by "Foreign
> > Processing", such as aggregation, window functions, distinct, order
> > by, row locking, table modification, or combinations of them.
> 
> > "Scan" is a better word than "Processing". From plan's perspective it's
> > ultimately a Scan (on the data produced by the foreign server) and not
> > processing.
> 
> Exactly, but one thing I'm concerned about is the table modification
> case; the EXPLAIN output would be something like this:
> 
>Foreign Scan
>  Remote SQL: INSERT INTO remote_table VALUES ...
> 
> That would be strange, so I think a more broader word is better.  I
> don't think "Foreign Processing" is best.  "Foreign Upper" might be much
> better because the corresponding path is created by calling
> GetForeignUpperPaths.
>
Be "Foreign %s", and gives "Insert" label by postgres_fdw; which knows
the ForeignScan node actually insert tuples.
From the standpoint of users, it looks "Foreign Insert".

> Also for a Foreign Scan representing a foreign join, I think "Foreign
> Join" is better than "Foreign Scan".  Here is an example:
> 
>Foreign Join on foreign_table1, foreign_table2
>  Remote SQL: SELECT ... FROM remote_table1 INNER JOIN remote_table2
> WHERE ...
> 
> I think "Foreign Join" better indicates that foreign tables listed after
> that (ie, foreign_table1 and foreign_table2 in the example) are joining
> (or component) tables of this join, than "Foreign Scan".
>
Postgres_fdw knows it is remote join. It is quite easy to tell the core
this ForeignScan node is "Join". Also, it knows which tables are involved in.
We can add hint information to control relations name to be printed.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-28 Thread Kouhei Kaigai
> I wrote:
> >> That may be so, but my point is that the target relations involved in
> >> the foreign join (ie, ft1 and ft2) should be printed somewhere in the
> >> EXPLAIN output by core, as in EXPLAIN for a simple foreign table scan.
> 
> > Why? According to your rule, Hash Join should take "on t0,t1,t2".
> >
> > postgres=# explain select id from t0 natural join t1 natural join t2;
> >  QUERY PLAN
> > -
> >  Hash Join  (cost=6370.00..4560826.24 rows=98784048 width=4)
> >Hash Cond: (t0.aid = t1.aid)
> >->  Hash Join  (cost=3185.00..3199360.58 rows=98784048 width=8)
> >  Hash Cond: (t0.bid = t2.bid)
> >  ->  Seq Scan on t0  (cost=0.00..184.80 rows=10080 width=12)
> >  ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
> >->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
> >->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
> >  ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
> > (9 rows)
> 
> I don't think it needs "on t0,t1,t2", because we can see joining
> relations from inner/outer plans in that case.  In a foreign-join case,
> however, we can't see such relations from the EXPLAIN printed *by core*.
>   postgres_fdw avoids this issue by adding such relations to the EXPLAIN
> using ExplainForeignScan as shown in the below example, but since such
> relations are essential, I think that information should be shown by
> core itself.
>
I never opposed to print the relation names by the core, however, it has
to be controllable by the extension which provides ForeignScan/CustomScan
because only extension can know how underlying relation shall be scanned
exactly.

> postgres=# explain select * from (select ft1.a from ft1 left join ft2 on
> ft1.a = ft2.a where ft1.b = 1) ss1(a) full join (select ft3.a from ft3
> left join ft4 on ft3.a = ft4.a where ft3.b = 1) ss2(a) on ss1.a = ss2.a;
> QUERY PLAN
> 
>   Hash Full Join  (cost=202.06..204.12 rows=1 width=8)
> Hash Cond: (ft1.a = ft3.a)
> ->  Foreign Scan  (cost=100.00..102.05 rows=1 width=4)
>   Relations: (public.ft1) LEFT JOIN (public.ft2)
> ->  Hash  (cost=102.05..102.05 rows=1 width=4)
>   ->  Foreign Scan  (cost=100.00..102.05 rows=1 width=4)
> Relations: (public.ft3) LEFT JOIN (public.ft4)
> (7 rows)
> 
>  From the Relations line shown by postgres_fdw, we can see which foreign
> join joins which foreign tables, but if no such lines, we couldn't.
>
In case of postgres_fdw, if extension can indicate the core EXPLAIN to
print all of its underlying relations, the core EXPLAIN routine will
print name of the relations. Here is no problem.

> >>> postgres=# explain select id from t0 natural join t1 natural join t2;
> >>> QUERY PLAN
> >>> ---
> >>>  Custom Scan (GpuJoin)  (cost=12385.67..291245.35 rows=9815465 width=4)
> >>>GPU Projection: t0.id
> >>>Depth 1: GpuHashJoin, HashKeys: (t0.bid)
> >>> JoinQuals: (t0.bid = t2.bid)
> >>> Nrows (in/out: 98.15%), KDS-Hash (size: 13.47MB, nbatches: 1)
> >>>Depth 2: GpuHashJoin, HashKeys: (t0.aid)
> >>> JoinQuals: (t0.aid = t1.aid)
> >>> Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
> >>>->  Seq Scan on t0  (cost=0.00..18.96 rows=996 width=12)
> >>>->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
> >>>->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
> >>> (11 rows)
> 
> > My largest concern for you proposition is, ForeignScan/CustomScan node is
> > enforced to print name of underlying relations, regardless of its actual
> > behavior. The above GpuJoin never scans tables at least, thus, it mislead
> > users if we have no choice to print underlying relation names.
> 
> OK, I understand we would need special handling for such custom joins.
>
It is not a case only for CustomScan.

Please assume an enhancement of postgres_fdw that reads a small local table 
(tbl_1)
and parse them as VALUES() clause within a remote query to execute remote join
with foreign tables (ftbl_2, ftbl_3).
This ForeignScan node has three underlying relations; tbl_1, ftbl_2 and ftbl_3.
Likely, tbl_1 will be scanned by SeqScan, not ForeignScan itself.
In this case, which relations shall be printed around ForeignScan?
Is it possible to choose proper relation names without hint by the extension?
   
Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-28 Thread Kouhei Kaigai
> On 2016/07/28 10:01, Kouhei Kaigai wrote:
> > What I'm saying is here:
> 
> > EXPLAIN (COSTS false, VERBOSE)
> > SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY
> > t1.c3, t1.c1 OFFSET 100 LIMIT 10;
> >   QUERY PLAN
> > ---
> > Limit
> >   Output: t1.c1, t2.c1, t1.c3
> >   ->  Foreign Scan
> >   
> > Output: t1.c1, t2.c1, t1.c3
> > Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> > Remote SQL: SELECT r1."C 1", r1.c3, r2."C 1" FROM ("S 1"."T
> > 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY
> > r1.c3 ASC N\
> > ULLS LAST, r1."C 1" ASC NULLS LAST
> > (6 rows)
> 
> > Your concern is that EXPLAIN shows ForeignScan node as "Foreign Scan"
> > although it actually works as "Foreign Join".
> 
> That may be so, but my point is that the target relations involved in
> the foreign join (ie, ft1 and ft2) should be printed somewhere in the
> EXPLAIN output by core, as in EXPLAIN for a simple foreign table scan.
>
Why? According to your rule, Hash Join should take "on t0,t1,t2".

postgres=# explain select id from t0 natural join t1 natural join t2;
 QUERY PLAN
-
 Hash Join  (cost=6370.00..4560826.24 rows=98784048 width=4)
   Hash Cond: (t0.aid = t1.aid)
   ->  Hash Join  (cost=3185.00..3199360.58 rows=98784048 width=8)
 Hash Cond: (t0.bid = t2.bid)
 ->  Seq Scan on t0  (cost=0.00..184.80 rows=10080 width=12)
 ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
   ->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
   ->  Hash  (cost=1935.00..1935.00 rows=10 width=4)
 ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
(9 rows)

> > My suggestion makes EXPLAIN print "Foreign %s" according to the label
> > assigned by the extension. Only FDW driver knows how this ForeignScan
> > node actually performs on, thus, only FDW driver can set meaningful label.
> >
> > This label may be "Join", may be "Partial Aggregate + Join", or something
> > others according to the actual job of this node.
> 
> OK, thanks for that.  But sorry, I'm not sure that's a good idea.  One
> reason for that is that FDWs/extentions might give different labels to
> the same upper-planner action, which would lead to confusing EXPLAINs.
>
If extension put a label hard to understand, it is a problem of the extension.

> Another is that labeling might be annoying, so some FDWs/extensions may
> not want to do that.
>
I'm happy with arbitrary label, not annoying. :-)

> Couldn't we print EXPLAINs in a more unified way?  A simple idea is to
> introduce a broad label, eg, "Foreign Processing", as a unified label;
> if the FDW/extension performs any upper-planner actions remotely, then
> the label "Foreign Processing" will be printed by core, and following
> that, the FDW/extension would print what they want, using
> ExplainForeignScan/ExplainCustomScan.  Probably something like this:
> 
>Foreign Processing
>  Remote Operations: ...
> 
> In the Remote Operations line, the FDW/extension could print any info
> about remote operations, eg, "Scan/Join + Aggregate".  Different data
> sources have different concepts/terms, so it seems reasonable to me that
> there would be different descriptions by different data sources, to the
> same plan actions done remotely.
>
"Foreign" implies this node is processed by FDW, but "Procesing" gives us
no extra information; seems to me redundant.
Prior to the new invention, please explain why you don't want to by my
suggestion first? Annoying is a feel of you, but not a logic to persuade
others.

> >>>  - A flag to turn on/off printing relation(s) name
> >>
> >> ISTM that the relation information should be always printed even in the
> >> case of scanrelid=0 by the core, because that would be essential for
> >> analyzing EXPLAIN results.
> 
> > We have no information which relations are actually scanned by ForeignScan
> > and CustomScan. Your patch abuses fs_relids and custom_relids, however,
> > these fields don't indicate the relations actually scan by these nodes.
> > It just tracks relations processed by this node or its children.
> 
> Maybe my explanation was not enough, but I just mean that *joining
> relations* s

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-27 Thread Kouhei Kaigai
> On 2016/07/27 13:51, Kouhei Kaigai wrote:
> > This output is, at least, not incorrect.
> > This ForeignScan actually scan a relation that consists of two joined
> > tables on the remote side. So, not incorrect, but may not convenient for
> > better understanding by users who don't have deep internal knowledge.
> 
> Hmm.
> 
> > On the other hands, I cannot be happy with your patch because it concludes
> > the role of ForeignScan/CustomScan with scanrelid==0 is always join.
> > However, it does not cover all the case.
> >
> > When postgres_fdw gets support of remote partial aggregate, we can implement
> > the feature using ForeignScan with scanrelid==0. Is it a join? No.
> 
> Yeah, I think that that could be implemented in both cases (scanrelid>0
> and scanrelid=0).
> 
> > Probably, the core cannot know exact purpose of ForeignScan/CustomScan with
> > scanrelid==0. It can be a remote partial aggregation. It can be an 
> > alternative
> > sort logic by GPU. It can be something others.
> > So, I think extension needs to inform the core more proper label to show;
> > including a flag to control whether the relation(s) name shall be attached
> > next to the ForeignScan/CustomScan line.
> >
> > I'd like you to suggest to add two fields below:
> >  - An alternative label extension wants to show instead of the default one
> 
> How about adding something like the "Additional Operations" line as
> proposed in a previous email, instead?  As you know, FDWs/Extensions
> could add such information through ExplainForeignScan/ExplainCustomScan.
>
No. What I'm saying is here:


EXPLAIN (COSTS false, VERBOSE)
SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY  
t1.c3, t1.c1 OFFSET 100 LIMIT 10;
  QUERY PLAN   
---
Limit
  Output: t1.c1, t2.c1, t1.c3
  ->  Foreign Scan
  
Output: t1.c1, t2.c1, t1.c3
Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
Remote SQL: SELECT r1."C 1", r1.c3, r2."C 1" FROM ("S 1"."T  
1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY  
r1.c3 ASC N\
ULLS LAST, r1."C 1" ASC NULLS LAST
(6 rows)


Your concern is that EXPLAIN shows ForeignScan node as "Foreign Scan"
although it actually works as "Foreign Join".

My suggestion makes EXPLAIN print "Foreign %s" according to the label 
assigned by the extension. Only FDW driver knows how this ForeignScan
node actually performs on, thus, only FDW driver can set meaningful label.

This label may be "Join", may be "Partial Aggregate + Join", or something
others according to the actual job of this node.

> >  - A flag to turn on/off printing relation(s) name
> 
> ISTM that the relation information should be always printed even in the
> case of scanrelid=0 by the core, because that would be essential for
> analyzing EXPLAIN results.
>
We have no information which relations are actually scanned by ForeignScan
and CustomScan. Your patch abuses fs_relids and custom_relids, however,
these fields don't indicate the relations actually scan by these nodes.
It just tracks relations processed by this node or its children.

See the example below. This GpuJoin on CustomScan takes three SeqScan nodes
as inner/outer source of relations joins. GpuJoin never scans the tables
actually. It picks up the records generated by underlying SeqScan nodes.


postgres=# explain select id from t0 natural join t1 natural join t2;
QUERY PLAN
---
 Custom Scan (GpuJoin)  (cost=12385.67..291245.35 rows=9815465 width=4)
   GPU Projection: t0.id
   Depth 1: GpuHashJoin, HashKeys: (t0.bid)
JoinQuals: (t0.bid = t2.bid)
Nrows (in/out: 98.15%), KDS-Hash (size: 13.47MB, nbatches: 1)
   Depth 2: GpuHashJoin, HashKeys: (t0.aid)
JoinQuals: (t0.aid = t1.aid)
Nrows (in/out: 100.00%), KDS-Hash (size: 13.47MB, nbatches: 1)
   ->  Seq Scan on t0  (cost=0.00..18.96 rows=996 width=12)
   ->  Seq Scan on t2  (cost=0.00..1935.00 rows=10 width=4)
   ->  Seq Scan on t1  (cost=0.00..1935.00 rows=10 width=4)
(11 rows)


If EXPLAIN command attaches something like "on t0,t1,t2" after the GpuJoin
*with no choice*, it is problematic and misleading.
It shall be controllable by the extension which knows what tables are
actually scanned. (Please note that I never says extension makes the string.
It is a job of core explain. I suggest it needs to be controllable.)


Even though I recommended a boolean flag to turn on/off this 

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-07-26 Thread Kouhei Kaigai
> I noticed that currently the core doesn't show any information on the
> target relations involved in a foreign/custom join in EXPLAIN, by
> itself.  Here is an example:
> 
> -- join two tables
> EXPLAIN (COSTS false, VERBOSE)
> SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY
> t1.c3, t1.c1 OFFSET 100 LIMIT 10;
>   QUERY PLAN   \
> 
> --
> -\
> ---
> Limit
>   Output: t1.c1, t2.c1, t1.c3
>   ->  Foreign Scan
> Output: t1.c1, t2.c1, t1.c3
> Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
> Remote SQL: SELECT r1."C 1", r1.c3, r2."C 1" FROM ("S 1"."T
> 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY
> r1.c3 ASC N\
> ULLS LAST, r1."C 1" ASC NULLS LAST
> (6 rows)
> 
> postgres_fdw shows the target relations in the Relations line, as shown
> above, but I think that the core should show such information
> independently of FDWs; in the above example replace "Foreign Scan" with
> "Foreign Join on public.ft1 t1, public.ft2 t2".  Please find attached a
> patch for that.  Comments welcome!
>
This output is, at least, not incorrect.
This ForeignScan actually scan a relation that consists of two joined
tables on the remote side. So, not incorrect, but may not convenient for
better understanding by users who don't have deep internal knowledge.

On the other hands, I cannot be happy with your patch because it concludes
the role of ForeignScan/CustomScan with scanrelid==0 is always join.
However, it does not cover all the case.

When postgres_fdw gets support of remote partial aggregate, we can implement
the feature using ForeignScan with scanrelid==0. Is it a join? No.

Probably, the core cannot know exact purpose of ForeignScan/CustomScan with
scanrelid==0. It can be a remote partial aggregation. It can be an alternative
sort logic by GPU. It can be something others.
So, I think extension needs to inform the core more proper label to show;
including a flag to control whether the relation(s) name shall be attached
next to the ForeignScan/CustomScan line.

I'd like you to suggest to add two fields below:
 - An alternative label extension wants to show instead of the default one
 - A flag to turn on/off printing relation(s) name

EXPLAIN can print proper information according to these requirements.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] make clean didn't clean up files generated from *.(y|l)

2016-06-29 Thread Kouhei Kaigai
> Kouhei Kaigai <kai...@ak.jp.nec.com> writes:
> > I tried to build the latest master branch just after the switch from
> > REL9_5_STABLE and "make clean", however, repl_gram.c was not cleaned
> > up correctly. So, my problem is that repl_gram.l was the latest version,
> > but compiler saw the repl_gram.c generated based on the v9.5 source.
> > ...
> > Probably, we have to add explicit cleanup of these auto-generated files
> > on Makefiles.
> 
> "make clean" absolutely should NOT remove that file; not even "make
> distclean" should, because we ship it in tarballs.  Likewise for the other
> bison product files you mention, as well as a boatload of other derived
> files.
> 
> If you want to checkout a different release branch in the same working
> directory, I'd suggest "make maintainer-clean" or "git clean -dfx" first.
> (Personally I don't ever do that --- it's much easier to maintain a
> separate workdir per branch.)
> 
> Having said that, switching to a different branch should have resulted in
> repl_gram.l being updated by git, and thereby acquiring a new file mod
> date; so I don't understand why make wouldn't have chosen to rebuild
> repl_gram.c.  Can you provide a reproducible sequence that makes this
> happen?
>
Ah, I might have inadequate operation just before the branch switching.

$ cd ~/source/pgsql <-- REL9_5_STABLE; already built
$ git checkout master
$ cp -r ~/source/pgsql ~/repo/pgsql-kg
$ cd ~/repo/pgsql-kg
$ ./configure
$ make clean
$ make  <-- repl_gram.c raised an error

~/source/pgsql is a copy of community's branch; with no my own modification.
To keep it clean, I copied entire repository to other directory, but cp command
updated the file modification timestamp.
I may be the reason why repl_gram.c was not rebuilt.

Sorry for the noise.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] make clean didn't clean up files generated from *.(y|l)

2016-06-28 Thread Kouhei Kaigai
Hello,

I got the build error below. It concerns RESERVE_WAL is not defined,
however, it should not be a problem to be oversight for a long time.

I tried to build the latest master branch just after the switch from
REL9_5_STABLE and "make clean", however, repl_gram.c was not cleaned
up correctly. So, my problem is that repl_gram.l was the latest version,
but compiler saw the repl_gram.c generated based on the v9.5 source.

I could see the similar problems at:
  src/backend/replication/repl_gram.c
  src/interfaces/ecpg/preproc/pgc.c
  src/bin/pgbench/exprparse.c
  src/bin/pgbench/exprscan.c
  src/pl/plpgsql/src/pl_gram.c

(*) At least, these files raised a build error.

Probably, we have to add explicit cleanup of these auto-generated files
on Makefiles.

Thanks,


gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Wendif-labels -Wmissing-format-attribute -Wformat-security 
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -O0 -g -I. -I. 
-I../../../src/include -D_GNU_SOURCE   -c -o repl_gram.o repl_gram.c
In file included from repl_gram.y:320:0:
repl_scanner.l: In function 'replication_yylex':
repl_scanner.l:98:10: error: 'K_RESERVE_WAL' undeclared (first use in this 
function)
 RESERVE_WAL   { return K_RESERVE_WAL; }
  ^
repl_scanner.l:98:10: note: each undeclared identifier is reported only once 
for each function it appears in
make[3]: *** [repl_gram.o] Error 1
make[3]: Leaving directory `/home/kaigai/repo/pgsql/src/backend/replication'
make[2]: *** [replication-recursive] Error 2
make[2]: Leaving directory `/home/kaigai/repo/pgsql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/kaigai/repo/pgsql/src'
make: *** [all-src-recurse] Error 2


--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does people favor to have matrix data type?

2016-06-01 Thread Kouhei Kaigai
> -Original Message-
> From: Jim Nasby [mailto:jim.na...@bluetreble.com]
> Sent: Wednesday, June 01, 2016 11:32 PM
> To: Kaigai Kouhei(海外 浩平); Gavin Flower; Joe Conway; Ants Aasma; Simon Riggs
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 5/30/16 9:05 PM, Kouhei Kaigai wrote:
> > Due to performance reason, location of each element must be deterministic
> > without walking on the data structure. This approach guarantees we can
> > reach individual element with 2 steps.
> 
> Agreed.
> 
> On various other points...
> 
> Yes, please keep the discussion here, even when it relates only to PL/R.
> Whatever is being done for R needs to be done for plpython as well. I've
> looked at ways to improve analytics in plpython related to this, and it
> looks like I need to take a look at the fast-path function stuff. One of
> the things I've pondered for storing ndarrays in Postgres is how to
> reduce or eliminate the need to copy data from one memory region to
> another. It would be nice if there was a way to take memory that was
> allocated by one manager (ie: python's) and transfer ownership of that
> memory directly to Postgres without having to copy everything. Obviously
> you'd want to go the other way as well. IIRC cython's memory manager is
> the same as palloc in regard to very large allocations basically being
> ignored completely, so this should be possible in that case.
> 
> One thing I don't understand is why this type needs to be limited to 1
> or 2 dimensions? Isn't the important thing how many individual elements
> you can fit into GPU? So if you can fit a 1024x1024, you could also fit
> a 100x100x100, a 32x32x32x32, etc. At low enough values maybe that stops
> making sense, but I don't see why there needs to be an artificial limit.
>
Simply, I didn't looked at the matrix larger than 2-dimensional. 
Is it a valid mathematic concept?
Because of the nature of GPU, it is designed to map threads on X-, Y-,
and Z-axis. However, not limited to 3-dimensions, because programmer can
handle upper 10bit of X-axis as 4th dimension for example.

> I think what's important for something like kNN is that the storage is
> optimized for this, which I think means treating the highest dimension
> as if it was a list. I don't know if it then matters whither the lower
> dimensions are C style vs FORTRAN style. Other algorithms might want
> different storage.
>
FORTRAN style is preferable, because it allows to use BLAS library without
data format transformation.
I'm not sure why you prefer a list structure on the highest dimension.
A simple lookup table is enough and suitable for massive parallel processors.

> Something else to consider is the 1G toast limit. I'm pretty sure that's
> why MADlib stores matricies as a table of vectors. I know for certain
> it's a problem they run into, because they've discussed it on their
> mailing list.
> 
> BTW, take a look at MADlib svec[1]... ISTM that's just a special case of
> what you're describing with entire grids being zero (or vice-versa).
> There might be some commonality there.
> 
> [1] https://madlib.incubator.apache.org/docs/v1.8/group__grp__svec.html
>
Once we try to deal with a table as representation of matrix, it goes
beyond the scope of data type in PostgreSQL. Likely, users have to take
something special jobs to treat it, more than operator functions that
support matrix data types.

For a matrix larger than toast limit, my thought is that; a large matrix
which is consists of multiple grid can have multiple toast pointers for
each sub-matrix. Although individual sub-matrix must be up to 1GB, we can
represent an entire matrix as a set of grid more than 1GB.

As I wrote in the previous message, a matrix structure head will have offset
to each sub-matrix of grid. The sub-matrix will be inline, or external toast
if VARATT_IS_1B_E(ptr).
Probably, we also have to hack row deletion code not to leak sub-matrix in
the toast table.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

 --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does people favor to have matrix data type?

2016-05-30 Thread Kouhei Kaigai
> -Original Message-
> From: Gavin Flower [mailto:gavinflo...@archidevsys.co.nz]
> Sent: Tuesday, May 31, 2016 9:47 AM
> To: Kaigai Kouhei(海外 浩平); Joe Conway; Jim Nasby; Ants Aasma; Simon Riggs
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 31/05/16 12:01, Kouhei Kaigai wrote:
> >> On 05/29/2016 04:55 PM, Kouhei Kaigai wrote:
> >>> For the closer integration, it may be valuable if PL/R and PL/CUDA can 
> >>> exchange
> >>> the data structure with no serialization/de-serialization when PL/R code 
> >>> tries
> >>> to call SQL functions.
> >> I had been thinking about something similar. Maybe PL/R can create an
> >> extension within the R environment that wraps PL/CUDA directly or at the
> >> least provides a way to use a fast-path call. We should probably try to
> >> start out with one common use case to see how it might work and how much
> >> benefit there might be.
> >>
> > My thought is the second option above. If SPI interface supports fast-path
> > like 'F' protocol, it may become a natural way for other PLs also to
> > integrate SQL functions in other languages.
> >
> >>> IIUC, pg.spi.exec("SELECT my_function(...)") is the only way to call SQL 
> >>> functions
> >> inside PL/R scripts.
> >>
> >> Correct (currently).
> >>
> >> BTW, this is starting to drift off topic I think -- perhaps we should
> >> continue off list?
> >>
> > Some elements are common for PostgreSQL (matrix data type and fastpath SPI
> > interface). I like to keep the discussion on the list.
> > Regarding to the PoC on a particular use case, it might be an off-list
> > discussion.
> >
> > Thanks,
> > --
> > NEC Business Creation Division / PG-Strom Project
> > KaiGai Kohei <kai...@ak.jp.nec.com>
> >
> Possibly there should be two matrix types in Postgres: the first would
> be the default and optimized for small dense matrices, the second would
> store large sparse matrices efficiently in memory at the expensive of
> speed (possibly with one or more parameters relating to how sparse it is
> likely to be?) - for appropriate definitions 'small' & 'large', though
> memory savings for the latter type might not kick in unless the matrices
> are big enough (so a small sparse matrix might consume more memory than
> a nominally larger dense matrix type & a sparse matrix might have to be
> sufficiently sparse to make real memory savings at any size).
>
One idea in my mind is that a sparse matrix is represented as a grid
of a smaller matrixes, and omit all-zero area. It looks like indirect
pointer reference. The header of matrix type has offset values to
each grid. If offset == 0, it means this grid contains all-zero.

Due to performance reason, location of each element must be deterministic
without walking on the data structure. This approach guarantees we can
reach individual element with 2 steps.

A flat matrix can be represented as a special case of the sparse matrix.
If entire matrix is consists of 1x1 grid, it is a flat matrix.
We may not need to define two individual data types.

> Probably good to think of 2 types at the start, even if the only the
> first is implemented initially.
>
I agree.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does people favor to have matrix data type?

2016-05-30 Thread Kouhei Kaigai
> On 05/29/2016 04:55 PM, Kouhei Kaigai wrote:
> > For the closer integration, it may be valuable if PL/R and PL/CUDA can 
> > exchange
> > the data structure with no serialization/de-serialization when PL/R code 
> > tries
> > to call SQL functions.
> 
> I had been thinking about something similar. Maybe PL/R can create an
> extension within the R environment that wraps PL/CUDA directly or at the
> least provides a way to use a fast-path call. We should probably try to
> start out with one common use case to see how it might work and how much
> benefit there might be.
>
My thought is the second option above. If SPI interface supports fast-path
like 'F' protocol, it may become a natural way for other PLs also to
integrate SQL functions in other languages.

> > IIUC, pg.spi.exec("SELECT my_function(...)") is the only way to call SQL 
> > functions
> inside PL/R scripts.
> 
> Correct (currently).
> 
> BTW, this is starting to drift off topic I think -- perhaps we should
> continue off list?
>
Some elements are common for PostgreSQL (matrix data type and fastpath SPI
interface). I like to keep the discussion on the list.
Regarding to the PoC on a particular use case, it might be an off-list
discussion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does people favor to have matrix data type?

2016-05-29 Thread Kouhei Kaigai
> On 05/28/2016 03:33 PM, Kouhei Kaigai wrote:
> >> -Original Message-
> >> From: Joe Conway [mailto:m...@joeconway.com]
> >> Sent: Sunday, May 29, 2016 1:40 AM
> >> To: Kaigai Kouhei(海外 浩平); Jim Nasby; Ants Aasma; Simon Riggs
> >> Cc: pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> >>
> >> On 05/28/2016 07:12 AM, Kouhei Kaigai wrote:
> >>> Sparse matrix! It is a disadvantaged area for the current array format.
> >>>
> >>> I have two ideas. HPC folks often split a large matrix into multiple
> >>> grid. A grid is typically up to 1024x1024 matrix, for example.
> >>> If a grid is consists of all zero elements, it is obvious we don't need
> >>> to have individual elements on the grid.
> >>> One other idea is compression. If most of matrix is zero, it is an ideal
> >>> data for compression, and it is easy to reconstruct only when calculation.
> >>>
> >>>> Related to this, Tom has mentioned in the past that perhaps we should
> >>>> support abstract use of the [] construct. Currently point finds a way to
> >>>> make use of [], but I think that's actually coded into the grammar.
> >>>>
> >>> Yep, if we consider 2D-array is matrix, no special enhancement is needed
> >>> to use []. However, I'm inclined to have own data structure for matrix
> >>> to present the sparse matrix.
> >>
> >> +1 I'm sure this would be useful for PL/R as well.
> >>
> >> Joe
> >>
> > It is pretty good idea to combine PL/R and PL/CUDA (what I'm now working)
> > for advanced analytics. We will be able to off-load heavy computing portion
> > to GPU, then also utilize various R functions inside database.
> 
> Agreed. Perhaps at some point we should discuss closer integration of
> some sort, or at least a sample use case.
>
What I'm trying to implement first is k-means clustering by GPU. It core 
workload
is iteration of massive distance calculations. When I run kmeans() function of R
for million items with 10 clusters on 40 dimensions, it took about thousand 
seconds.
If GPU version provides the result matrix more rapidly, then I expect R can plot
relationship between items and clusters in human friendly way.

For the closer integration, it may be valuable if PL/R and PL/CUDA can exchange
the data structure with no serialization/de-serialization when PL/R code tries
to call SQL functions. IIUC, pg.spi.exec("SELECT my_function(...)") is the only
way to call SQL functions inside PL/R scripts.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does people favor to have matrix data type?

2016-05-28 Thread Kouhei Kaigai
> -Original Message-
> From: Joe Conway [mailto:m...@joeconway.com]
> Sent: Sunday, May 29, 2016 1:40 AM
> To: Kaigai Kouhei(海外 浩平); Jim Nasby; Ants Aasma; Simon Riggs
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 05/28/2016 07:12 AM, Kouhei Kaigai wrote:
> > Sparse matrix! It is a disadvantaged area for the current array format.
> >
> > I have two ideas. HPC folks often split a large matrix into multiple
> > grid. A grid is typically up to 1024x1024 matrix, for example.
> > If a grid is consists of all zero elements, it is obvious we don't need
> > to have individual elements on the grid.
> > One other idea is compression. If most of matrix is zero, it is an ideal
> > data for compression, and it is easy to reconstruct only when calculation.
> >
> >> Related to this, Tom has mentioned in the past that perhaps we should
> >> support abstract use of the [] construct. Currently point finds a way to
> >> make use of [], but I think that's actually coded into the grammar.
> >>
> > Yep, if we consider 2D-array is matrix, no special enhancement is needed
> > to use []. However, I'm inclined to have own data structure for matrix
> > to present the sparse matrix.
> 
> +1 I'm sure this would be useful for PL/R as well.
> 
> Joe
>
It is pretty good idea to combine PL/R and PL/CUDA (what I'm now working)
for advanced analytics. We will be able to off-load heavy computing portion
to GPU, then also utilize various R functions inside database.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does people favor to have matrix data type?

2016-05-28 Thread Kouhei Kaigai
> On 5/25/16 7:46 AM, Kouhei Kaigai wrote:
> > My only concern is that domain type is not allowed to define type cast.
> > If we could add type cast on domain, we can define type transformation from
> > other array type to matrix.
> 
> I've actually wished for that in the past, as well as casting to
> compound types. Having those would make it easier to mock up a real data
> type for experimentation.
> 
> I strongly encourage you to talk to the MADlib community about
> first-class matrix support. They currently emulate matrices via arrays.
>
A MADLib folks contacted me in the past; when I initially made PG-Strom
several years before, however, no communication with them after that.

https://madlib.incubator.apache.org/docs/v1.8/group__grp__arraysmatrix.html
According to the documentation, their approach is different from what
I'd like to implement. They use a table as if a matrix. Because of its
data format, we have to adjust position of the data element to fit
requirement by high performance matrix libraries (like cuBLAS).

> I don't know offhand if they support NULLs in their regular matrices.
> They also support a sparsematrix "type" that is actually implemented as
> a table that contains coordinates and a value for each value in the
> matrix. Having that as a type might also be interesting (if you're
> sparse enough, that will be cheaper than the current NULL bitmap
> implementation).
>
Sparse matrix! It is a disadvantaged area for the current array format.

I have two ideas. HPC folks often split a large matrix into multiple
grid. A grid is typically up to 1024x1024 matrix, for example.
If a grid is consists of all zero elements, it is obvious we don't need
to have individual elements on the grid.
One other idea is compression. If most of matrix is zero, it is an ideal
data for compression, and it is easy to reconstruct only when calculation.

> Related to this, Tom has mentioned in the past that perhaps we should
> support abstract use of the [] construct. Currently point finds a way to
> make use of [], but I think that's actually coded into the grammar.
>
Yep, if we consider 2D-array is matrix, no special enhancement is needed
to use []. However, I'm inclined to have own data structure for matrix
to present the sparse matrix.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does people favor to have matrix data type?

2016-05-25 Thread Kouhei Kaigai
> On Wed, May 25, 2016 at 10:38 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> > On 25 May 2016 at 03:52, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> >>
> >> In a few days, I'm working for a data type that represents matrix in
> >> mathematical area. Does people favor to have this data type in the core,
> >> not only my extension?
> >
> >
> > If we understood the use case, it might help understand whether to include
> > it or not.
> >
> > Multi-dimensionality of arrays isn't always useful, so this could be good.
> 
> Many natural language and image processing methods extract feature
> vectors that then use some simple distance metric, like dot product to
> calculate vector similarity. For example we presented a latent
> semantic analysis prototype at pgconf.eu 2015 that used real[] to
> store the features and a dotproduct(real[], real[]) real function to
> do similarity matching. However using real[] instead of a hypothetical
> realvector or realmatrix did not prove to be a huge overhead, so
> overall I'm on the fence for the usefulness of a special type. Maybe a
> helper function or two to validate the additional restrictions in a
> check constraint would be enough.
>
The 'matrix' data type as domain type of real[] is an option to implement.
We can define operators on the domain types, thus, it allows us to process
large amount of calculation by one operation, in native binary speed.

My only concern is that domain type is not allowed to define type cast.
If we could add type cast on domain, we can define type transformation from
other array type to matrix.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does people favor to have matrix data type?

2016-05-25 Thread Kouhei Kaigai
> -Original Message-
> From: Simon Riggs [mailto:si...@2ndquadrant.com]
> Sent: Wednesday, May 25, 2016 4:39 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Does people favor to have matrix data type?
> 
> On 25 May 2016 at 03:52, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> 
> 
>   In a few days, I'm working for a data type that represents matrix in
>   mathematical area. Does people favor to have this data type in the core,
>   not only my extension?
> 
> 
> If we understood the use case, it might help understand whether to include it 
> or not.
> 
> Multi-dimensionality of arrays isn't always useful, so this could be good.
>
As you may expect, the reason why I've worked for matrix data type is one of
the groundwork for GPU acceleration, but not limited to.

What I tried to do is in-database calculation of some analytic algorithm; not
exporting entire dataset to client side.
My first target is k-means clustering; often used to data mining.
When we categorize N-items which have M-attributes into k-clusters, the master
data can be shown in NxM matrix; that is equivalent to N vectors in M-dimension.
The cluster centroid is also located inside of the M-dimension space, so it
can be shown in kxM matrix; that is equivalent to k vectors in M-dimension.
The k-means algorithm requires to calculate the distance to any cluster centroid
for each items, thus, it produces Nxk matrix; that is usually called as distance
matrix. Next, it updates the cluster centroid using the distance matrix, then
repeat the entire process until convergence.

The heart of workload is calculation of distance matrix. When I tried to write
k-means algorithm using SQL + R, its performance was not sufficient (poor).
  https://github.com/kaigai/toybox/blob/master/Rstat/pgsql-kmeans.r

If we would have native functions we can use instead of the complicated SQL
expression, it will make sense for people who tries in-database analytics.

Also, fortunately, PostgreSQL's 2-D array format is binary compatible to BLAS
library's requirement. It will allow GPU to process large matrix in HPC grade
performance.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Does people favor to have matrix data type?

2016-05-24 Thread Kouhei Kaigai
In a few days, I'm working for a data type that represents matrix in
mathematical area. Does people favor to have this data type in the core,
not only my extension?

Like oidvector or int2vector, it is one of array type with a few
restrictions:
 - 2 dimensional only
 - never contains NULL
 - element type is real or float
 - no lower bounds of array

A vector in mathematic is a special case of matrix; either 1xN or Nx1.

We can define various operators between matrix/scalar, like:
  matrix + matrix -> matrix
  matrix - scalar -> matrix
  matrix * matrix -> matrix
  transform(matrix) -> matrix
 :

How about people's thought?

If overall consensus is welcome to have, I'll set up a patch.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of David Rowley
> Sent: Tuesday, May 10, 2016 2:01 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] asynchronous and vectorized execution
> 
> On 10 May 2016 at 13:38, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > My concern about ExecProcNode is, it is constructed with a large switch
> > ... case statement. It involves tons of comparison operation at run-time.
> > If we replace this switch ... case by function pointer, probably, it make
> > performance improvement. Especially, OLAP workloads that process large
> > amount of rows.
> 
> I imagined that any decent compiler would have built the code to use
> jump tables for this. I have to say that I've never checked to make
> sure though.
>
Ah, indeed, you are right. Please forget above part.

In GCC 4.8.5, the case label between T_ResultState and T_LimitState were
handled using jump table.

TupleTableSlot *
ExecProcNode(PlanState *node)
{
:
  
:
switch (nodeTag(node))
  5ad361:   8b 03   mov(%rbx),%eax
  5ad363:   2d c9 00 00 00  sub$0xc9,%eax
  5ad368:   83 f8 24cmp$0x24,%eax
  5ad36b:   0f 87 4f 02 00 00   ja 5ad5c0 <ExecProcNode+0x290>
  5ad371:   ff 24 c5 68 48 8b 00jmpq   *0x8b4868(,%rax,8)
  5ad378:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
  5ad37f:   00

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Tuesday, May 10, 2016 2:34 AM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] asynchronous and vectorized execution
> 
> Hi,
> 
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:
> 
> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.
>
Is the parallel aware Append node sufficient to run multiple nodes
asynchronously? (Sorry, I couldn't have enough time to code the feature
even though we had discussion before.)
If a part of child-nodes are blocked by I/O or other heavy stuff, it
cannot enqueue the results into shm_mq, thus, Gather node naturally
skip nodes that are not ready.
In the above example, scan on foreign-table takes longer lead time than
local scan. If Append can map every child nodes on individual workers,
local scan worker begins to return tuples at first, then, mixed tuples
shall be returned eventually.

However, the process internal asynchronous execution may be also beneficial
in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers
are given to worker process). I think it allows to implement pre-fetching
very naturally.

> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior,

My concern about ExecProcNode is, it is constructed with a large switch
... case statement. It involves tons of comparison operation at run-time.
If we replace this switch ... case by function pointer, probably, it make
performance improvement. Especially, OLAP workloads that process large
amount of rows.

> since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.  I think that's
> probably right.   For example, consider a 5-table join where all of
> the joins are implemented as hash tables.  If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on.  What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat.  If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse.
>
I can agree with the above concern from my experience. Each HashJoin
step needs to fill up TupleTableSlot for each depth. Mostly, it is
just relocation of the attributes in case of multi-tables joins.

If HashJoin could gather five underlying 

[HACKERS] EPQ recheck across HashJoin, what does it actuall check?

2016-03-31 Thread Kouhei Kaigai
Folks,

Could you correct me if I misunderstand the execution path.

We may have the following example. Query try to get row level locking
on the table 't1' that is under the 'Hash' node.

postgres=# EXPLAIN
postgres-# SELECT * FROM t0 NATURAL JOIN t1 WHERE ax between 10 and 30 FOR 
UPDATE OF t1;
   QUERY PLAN
-
 LockRows  (cost=2682.53..263118.01 rows=1980172 width=93)
   ->  Hash Join  (cost=2682.53..243316.29 rows=1980172 width=93)
 Hash Cond: (t0.aid = t1.aid)
 ->  Seq Scan on t0  (cost=0.00..183332.58 rows=858 width=46)
 ->  Hash  (cost=2435.00..2435.00 rows=19802 width=51)
   ->  Seq Scan on t1  (cost=0.00..2435.00 rows=19802 width=51)
 Filter: ((ax >= '10'::double precision) AND (ax <= 
'30'::double precision))
(7 rows)

Concurrent update on 't1' by other backend may kick EPQ recheck to ensure
whether the source tuple is still visible or not.
In this example, if somebody updated 't1.ax' to 45 concurrently, LockRows
shall discard the joined tuple and shall not acquire row lock.

ExecLockRows() fills up es_epqTuple[] slot by locked/non-locked tuples,
then calls EvalPlanQualNext() to re-evaluate the tuple.
EvalPlanQualNext() is a thin wrapper of ExecProcNode(), thus, it walks
down into the child plan nodes.

If it would be some kind of scan node, ExecScan() -> ExecScanFetch() handles
the EPQ recheck appropriately. Here is no matter.

In case of HashJoin, it seems to me ExecHashJoin() takes no special behavior.
If a record that has 't1.ax' = 20.0 when it was fetched and loaded to hash
table was updated to 45.0, it is ought to be handled as an invisible row.
However, ExecHashJoin() does not walk down to the inner side again, thus,
the updated tuple might be considered as visible, and locked.

If it would not be my oversight, ExecHashJoin() and ExecHash() needs to call
ExecProcNode() then re-evaluate its join condition again, apart from the
hash key values of the latest tuples.

Thought?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-28 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Tuesday, March 29, 2016 10:54 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Petr Jelinek; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On Mon, Mar 28, 2016 at 9:36 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > I don't have a strong reason to keep these stuff in separate files.
> > Both stuffs covers similar features and amount of code are enough small.
> > So, the attached v4 just merged custom-node.[ch] stuff into extensible.
> >
> > Once we put similar routines closely, it may be better to consolidate
> > these routines.
> > As long as EXTNODENAME_MAX_LEN == CUSTOM_NAME_MAX_LEN, both features
> > have identical structure layout, so it is easy to call an internal
> > common function to register or find out a table of callbacks according
> > to the function actually called by other modules.
> >
> > I'm inclined to think to replace EXTNODENAME_MAX_LEN and
> > CUSTOM_NAME_MAX_LEN by NAMEDATALEN again, to fit structure layout.
> 
> I don't think that we need both EXTNODENAME_MAX_LEN and
> CUSTOM_NAME_MAX_LEN; we can use EXTNODENAME_MAX_LEN for both.  I'm
> opposed to using NAMEDATALEN for anything unrelated to the size of a
> Name.  If it's not being stored in a catalog, it doesn't need to care.
>
OK, I adjusted the v4 patch to use EXTNODENAME_MAX_LEN for both.

The structure of hash entry was revised as follows, then registered via
an internal common function: RegisterExtensibleNodeEntry, and found out
via also an internal common function: GetExtensibleNodeEntry.

typedef struct
{
charextnodename[EXTNODENAME_MAX_LEN];
const void *extnodemethods;
 } ExtensibleNodeEntry;

ExtensibleNodeMethods and CustomScanMethods shall be stored in
'extensible_node_methods' and 'custom_scan_methods' separatedly.
The entrypoint functions calls above internal common functions with
appropriate HTAB variable.

It will be re-usable if we would have further extensible nodes in the
future versions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


pgsql-v9.6-custom-scan-serialization-reworks.5.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-28 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Friday, March 25, 2016 12:27 AM
> To: Petr Jelinek
> Cc: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On Wed, Mar 23, 2016 at 1:36 PM, Petr Jelinek  wrote:
> > Ok, I am happy with it, marked it as ready for committer (it was marked as
> > committed although it wasn't committed).
> 
> Thanks for fixing the status.   I had forgotten about this thread.
> 
> I can't really endorse the naming conventions here.  I mean, we've got
> the main extensible nodes stuff in extensible.h, and then we've got
> this stuff in custom_node.h (BTW, there is a leftover reference to
> custom-node.h).  There's no hint in the naming that this relates to
> scans, and why is it extensible in one place and custom in another?
> 
> I'm not quite sure how to clean this up.  At a minimum, I think we
> should standardize on "custom_scan.h" instead of "custom_node.h".  I
> think that would be clearer.  But I'm wondering if we should bite the
> bullet and rename everything from "custom" to "extensible" and declare
> it all in "extensible.h".
>
I don't have a strong reason to keep these stuff in separate files.
Both stuffs covers similar features and amount of code are enough small.
So, the attached v4 just merged custom-node.[ch] stuff into extensible.

Once we put similar routines closely, it may be better to consolidate
these routines.
As long as EXTNODENAME_MAX_LEN == CUSTOM_NAME_MAX_LEN, both features
have identical structure layout, so it is easy to call an internal
common function to register or find out a table of callbacks according
to the function actually called by other modules.

I'm inclined to think to replace EXTNODENAME_MAX_LEN and
CUSTOM_NAME_MAX_LEN by NAMEDATALEN again, to fit structure layout.

> src/backend/nodes/custom_node.c:45: indent with spaces.
> +}
> 
Oops, thanks,

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-custom-scan-serialization-reworks.4.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.4.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Upper planner pathification

2016-03-28 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Saturday, March 19, 2016 8:57 AM
> To: Tom Lane
> Cc: Robert Haas; Petr Jelinek; David Rowley; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> > -Original Message-
> > From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> > Sent: Friday, March 18, 2016 11:44 PM
> > To: Kaigai Kouhei(海外 浩平)
> > Cc: Robert Haas; Petr Jelinek; David Rowley; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] WIP: Upper planner pathification
> >
> > Kouhei Kaigai <kai...@ak.jp.nec.com> writes:
> > > On Wed, Mar 16, 2016 at 2:47 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> > >> I do not, however, like the proposal to expose wflists and so forth.
> > >> Those are internal data structures in grouping_planner and I absolutely
> > >> refuse to promise that they're going to stay stable.
> >
> > > Hmm... It's not easy to imagine a case that extension wants own idea
> > > to extract window functions from the target list and select active
> > > windows, even if extension wants to have own executor and own cost
> > > estimation logic.
> > > In case when extension tries to add WindowPath + CustomPath(Sort),
> > > extension is interested in alternative sort task, but not window
> > > function itself. It is natural to follow the built-in implementation,
> > > thus, it motivates extension author to take copy & paste the code.
> > > select_active_windows() is static, so extension needs to have same
> > > routine on their side.
> >
> > Well, to be perfectly blunt about it, I have said from day one that this
> > notion that a CustomScan extension will be able to cause arbitrary planner
> > behavior changes is loony.  We are simply not going to drop a hook into
> > every tenth line of the planner for you, nor de-static-ify every internal
> > function, nor (almost equivalently) expose the data structures those
> > functions produce, because it would cripple core planner development to
> > try to keep the implied APIs stable.  And I continue to maintain that any
> > actually-generally-useful ideas would be better handled by submitting them
> > as patches to the core planner, rather than trying to implement them in an
> > arms-length extension.
> >
> > In the case at hand, I notice that the WindowFuncLists struct is
> > actually from find_window_functions in clauses.c, so an extension
> > that needed to get hold of that would be unlikely to do any copying
> > and pasting anyhow -- it'd just call find_window_functions again.
> > (That only needs to search the targetlist, so it's not that expensive.)
> > The other lists you mention are all tightly tied to specific, and not
> > terribly well-designed, implementation strategies for grouping sets and
> > window functions.  Those are *very* likely to change in the near future;
> > and even if they don't, I'm skeptical that an external implementor of
> > grouping sets or window functions would want to use exactly those same
> > implementation strategies.  Maybe the only reason you're there at all
> > is you want to be smarter about the order of doing window functions,
> > for example.
> >
> > So I really don't want to export any of that stuff.
> >
> Hmm. I could understand we have active development around this area
> thus miscellaneous internal data structure may not be enough stable
> to expose the extension.
> Don't you deny recall the discussion once implementation gets calmed
> down on the future development cycle?
> 
> > As for other details of the proposed patch, I was intending to put
> > all the hook calls into grouping_planner for consistency, rather than
> > scattering them between grouping_planner and its subroutines.  So that
> > would probably mean calling the hook for a given step *before* we
> > generate the core paths for that step, not after.  Did you have a
> > reason to want the other order?  (If you say "so the hook can look
> > at the core-made paths", I'm going to say "that's a bad idea".  It'd
> > further increase the coupling between a CustomScan extension and core.)
> >
> No deep reason. I just followed the manner in scan/join path hook; that
> calls extension once the core feature adds built-in path nodes.
>
Ah, I oversight a deep reason.
ForeignScan/CustomScan may have an alternative execution path if extension
suppo

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-23 Thread Kouhei Kaigai
> >> On 15/03/16 05:03, Kouhei Kaigai wrote:
> >>> Petr,
> >>>
> >>> The attached patch is the revised one that follows the new extensible-
> >>> node routine.
> >>>
> >>> It is almost same the previous version except for:
> >>> - custom-apis.[ch] was renamed to custom-node.[ch]
> >>> - check for the length of custom-scan-method name followed
> >>> the manner of RegisterExtensibleNodeMethods()
> >>>
> >>
> >> Hi,
> >>
> >> looks good, only nitpick I have is that it probably should be
> >> custom_node.h with underscore given that we use underscore everywhere
> >> (except for libpq and for some reason atomic ops).
> >>
> > Sorry for my response late.
> >
> > The attached patch just renamed custom-node.[ch] by custom_node.[ch].
> > Other portions are not changed from the previous revison.
> >
> 
> Forgot to attach?
>
Yes Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



pgsql-v9.6-custom-scan-serialization-reworks.3.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-23 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> Sent: Thursday, March 17, 2016 5:06 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On 15/03/16 05:03, Kouhei Kaigai wrote:
> > Petr,
> >
> > The attached patch is the revised one that follows the new extensible-
> > node routine.
> >
> > It is almost same the previous version except for:
> > - custom-apis.[ch] was renamed to custom-node.[ch]
> > - check for the length of custom-scan-method name followed
> >the manner of RegisterExtensibleNodeMethods()
> >
> 
> Hi,
> 
> looks good, only nitpick I have is that it probably should be
> custom_node.h with underscore given that we use underscore everywhere
> (except for libpq and for some reason atomic ops).
>
Sorry for my response late.

The attached patch just renamed custom-node.[ch] by custom_node.[ch].
Other portions are not changed from the previous revison.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Upper planner pathification

2016-03-19 Thread Kouhei Kaigai
> Robert Haas <robertmh...@gmail.com> writes:
> > On Wed, Mar 16, 2016 at 2:47 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >> Robert Haas <robertmh...@gmail.com> writes:
> >>> On Mon, Mar 14, 2016 at 9:21 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> 
> >>> wrote:
> >>>> So, even though we don't need to define multiple hook declarations,
> >>>> I think the hook invocation is needed just after create__paths()
> >>>> for each. It will need to inform extension the context of hook
> >>>> invocation, the argument list will take UpperRelationKind.
> 
> >>> That actually seems like a pretty good point.  Otherwise you can't
> >>> push anything from the upper rels down unless you are prepared to
> >>> handle all of it.
> 
> >> I'm not exactly convinced of the use-case for that.  What external
> >> thing is likely to handle window functions but not aggregation,
> >> for example?
> 
> > I'm not going to say that you're entirely wrong, but I think that
> > attitude is a bit short-sighted.
> 
> Well, I'm prepared to yield to the extent of repeating the hook call
> before each phase with an UpperRelationKind parameter to tell which phase
> we're about to do.  The main concern here is to avoid redundant
> computation, but the hook can check the "kind" parameter and fall out
> quickly if it has nothing useful to do at the current phase.
> 
> I do not, however, like the proposal to expose wflists and so forth.
> Those are internal data structures in grouping_planner and I absolutely
> refuse to promise that they're going to stay stable.  (I had already
> been thinking a couple of weeks ago about revising the activeWindows
> data structure, now that it would be reasonably practical to cost out
> different orders for doing the window functions in.)  I think a hook
> that has its own ideas about window function implementation methods
> can gather its own information about the WFs without that much extra
> cost, and it very probably wouldn't want exactly the same data that
> create_window_paths uses today anyway.
> 
> So what I would now propose is
> 
> typedef void (*create_upper_paths_hook_type) (PlannerInfo *root,
>   UpperRelationKind stage,
>   RelOptInfo *input_rel);
>
Hmm... It's not easy to imagine a case that extension wants own idea
to extract window functions from the target list and select active
windows, even if extension wants to have own executor and own cost
estimation logic.
In case when extension tries to add WindowPath + CustomPath(Sort),
extension is interested in alternative sort task, but not window
function itself. It is natural to follow the built-in implementation,
thus, it motivates extension author to take copy & paste the code.
select_active_windows() is static, so extension needs to have same
routine on their side.

On the other hands, 'rollup_lists' and 'rollup_groupclauses' need
three static functions (extract_rollup_sets(), reorder_grouping_sets()
and preprocess_groupclause() to reproduce the equivalent data structure.
It is larger copy & paste burden, if extension is not interested in
extracting the information related to grouping set.


I understand it is not "best", but better to provide extra information
needed for extension to reproduce equivalent pathnode, even if fields
of UpperPathExtraData structure is not stable right now.

> and have this invoked at each stage right before we call
> create_grouping_paths, create_window_paths, etc.
>
It seems to me reasonable.

> Also, I don't particularly see a need for a corresponding API for FDWs.
> If an FDW is going to do anything in this space, it presumably has to
> build up ForeignPaths for all the steps anyway.  So I'd be inclined
> to leave GetForeignUpperPaths as-is.
>
It seems to me reasonable. FDW driver which is interested in remote
execution of upper path can use the hook arbitrary.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Upper planner pathification

2016-03-19 Thread Kouhei Kaigai
> > Robert Haas <robertmh...@gmail.com> writes:
> > > On Mon, Mar 14, 2016 at 9:21 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> 
> > > wrote:
> > >> So, even though we don't need to define multiple hook declarations,
> > >> I think the hook invocation is needed just after create__paths()
> > >> for each. It will need to inform extension the context of hook
> > >> invocation, the argument list will take UpperRelationKind.
> >
> > > That actually seems like a pretty good point.  Otherwise you can't
> > > push anything from the upper rels down unless you are prepared to
> > > handle all of it.
> >
> > I'm not exactly convinced of the use-case for that.  What external
> > thing is likely to handle window functions but not aggregation,
> > for example?
> >
> WindowPath usually takes a SortPath. Even though extension don't want to
> handle window function itself, it may want to add alternative sort logic
> than built-in.
> Unless it does not calculate expected cost, nobody knows whether WindowPath +
> SortPath is really cheaper than WindowPath + CustomPath("GpuSort").
> 
> The supplied query may require to run group-by prior to window function,
> but extension may not be interested in group-by on the other hands, thus,
> extension needs to get control around the location where built-in logic
> also adds paths to fetch the cheapest path of the underlying paths.
>
If I would design the hook, I will put its entrypoint at:
- tail of create_grouping_paths(), just before set_cheapest()
- tail of create_window_paths(), just before set_cheapest()
- tail of create_distinct_paths(), just before set_cheapest()
- tail of create_ordered_paths(), just before set_cheapest()
- tail of grouping_planner(), after the loop of create_modifytable_path()

I'm not 100% certain whether the last one is the straightforward idea
to provide alternative writing stuff. For example, if an extension has
own special storage like columnar format, we may need more consideration
whether CustomPath and related stuff are suitable tool.

On the other hands, I believe the earlier 4 entrypoints are right location.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


pgsql-v9.6-upper-custom-path.1.patch
Description: pgsql-v9.6-upper-custom-path.1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Upper planner pathification

2016-03-19 Thread Kouhei Kaigai
> Robert Haas <robertmh...@gmail.com> writes:
> > On Mon, Mar 14, 2016 at 9:21 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> >> So, even though we don't need to define multiple hook declarations,
> >> I think the hook invocation is needed just after create__paths()
> >> for each. It will need to inform extension the context of hook
> >> invocation, the argument list will take UpperRelationKind.
> 
> > That actually seems like a pretty good point.  Otherwise you can't
> > push anything from the upper rels down unless you are prepared to
> > handle all of it.
> 
> I'm not exactly convinced of the use-case for that.  What external
> thing is likely to handle window functions but not aggregation,
> for example?
>
WindowPath usually takes a SortPath. Even though extension don't want to
handle window function itself, it may want to add alternative sort logic
than built-in.
Unless it does not calculate expected cost, nobody knows whether WindowPath +
SortPath is really cheaper than WindowPath + CustomPath("GpuSort").

The supplied query may require to run group-by prior to window function,
but extension may not be interested in group-by on the other hands, thus,
extension needs to get control around the location where built-in logic
also adds paths to fetch the cheapest path of the underlying paths.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Upper planner pathification

2016-03-18 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Sent: Friday, March 18, 2016 11:44 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; Petr Jelinek; David Rowley; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> Kouhei Kaigai <kai...@ak.jp.nec.com> writes:
> > On Wed, Mar 16, 2016 at 2:47 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >> I do not, however, like the proposal to expose wflists and so forth.
> >> Those are internal data structures in grouping_planner and I absolutely
> >> refuse to promise that they're going to stay stable.
> 
> > Hmm... It's not easy to imagine a case that extension wants own idea
> > to extract window functions from the target list and select active
> > windows, even if extension wants to have own executor and own cost
> > estimation logic.
> > In case when extension tries to add WindowPath + CustomPath(Sort),
> > extension is interested in alternative sort task, but not window
> > function itself. It is natural to follow the built-in implementation,
> > thus, it motivates extension author to take copy & paste the code.
> > select_active_windows() is static, so extension needs to have same
> > routine on their side.
> 
> Well, to be perfectly blunt about it, I have said from day one that this
> notion that a CustomScan extension will be able to cause arbitrary planner
> behavior changes is loony.  We are simply not going to drop a hook into
> every tenth line of the planner for you, nor de-static-ify every internal
> function, nor (almost equivalently) expose the data structures those
> functions produce, because it would cripple core planner development to
> try to keep the implied APIs stable.  And I continue to maintain that any
> actually-generally-useful ideas would be better handled by submitting them
> as patches to the core planner, rather than trying to implement them in an
> arms-length extension.
> 
> In the case at hand, I notice that the WindowFuncLists struct is
> actually from find_window_functions in clauses.c, so an extension
> that needed to get hold of that would be unlikely to do any copying
> and pasting anyhow -- it'd just call find_window_functions again.
> (That only needs to search the targetlist, so it's not that expensive.)
> The other lists you mention are all tightly tied to specific, and not
> terribly well-designed, implementation strategies for grouping sets and
> window functions.  Those are *very* likely to change in the near future;
> and even if they don't, I'm skeptical that an external implementor of
> grouping sets or window functions would want to use exactly those same
> implementation strategies.  Maybe the only reason you're there at all
> is you want to be smarter about the order of doing window functions,
> for example.
> 
> So I really don't want to export any of that stuff.
>
Hmm. I could understand we have active development around this area
thus miscellaneous internal data structure may not be enough stable
to expose the extension.
Don't you deny recall the discussion once implementation gets calmed
down on the future development cycle?

> As for other details of the proposed patch, I was intending to put
> all the hook calls into grouping_planner for consistency, rather than
> scattering them between grouping_planner and its subroutines.  So that
> would probably mean calling the hook for a given step *before* we
> generate the core paths for that step, not after.  Did you have a
> reason to want the other order?  (If you say "so the hook can look
> at the core-made paths", I'm going to say "that's a bad idea".  It'd
> further increase the coupling between a CustomScan extension and core.)
>
No deep reason. I just followed the manner in scan/join path hook; that
calls extension once the core feature adds built-in path nodes.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-14 Thread Kouhei Kaigai
Petr,

The attached patch is the revised one that follows the new extensible-
node routine.

It is almost same the previous version except for:
- custom-apis.[ch] was renamed to custom-node.[ch]
- check for the length of custom-scan-method name followed
  the manner of RegisterExtensibleNodeMethods()

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Tuesday, March 15, 2016 2:54 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Petr Jelinek; pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] Reworks of CustomScan
> serialization/deserialization
> 
> On Sun, Mar 13, 2016 at 9:53 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > OK, I split the previous small patch into two tiny patches.
> > The one is bugfix around max length of the extnodename.
> > The other replaces Assert() by ereport() according to the upthread 
> > discussion.
> 
> Committed, except that (1) I replaced ereport() with elog(), because I
> can't see making translators care about this message; and (2) I
> reworded the error message a bit.
> 
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


pgsql-v9.6-custom-scan-serialization-reworks.2.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Upper planner pathification

2016-03-14 Thread Kouhei Kaigai




> -Original Message-
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Sent: Tuesday, March 15, 2016 2:04 AM
> To: Petr Jelinek
> Cc: Kaigai Kouhei(海外 浩平); David Rowley; Robert Haas;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> Petr Jelinek <p...@2ndquadrant.com> writes:
> > On 14/03/16 02:43, Kouhei Kaigai wrote:
> >> Even though I couldn't check the new planner implementation entirely,
> >> it seems to be the points below are good candidate to inject CustomPath
> >> (and potentially ForeignScan).
> >>
> >> - create_grouping_paths
> >> - create_window_paths
> >> - create_distinct_paths
> >> - create_ordered_paths
> >> - just below of the create_modifytable_path
> >> (may be valuable for foreign-update pushdown)
> 
> > To me that seems too low inside the planning tree, perhaps adding it
> > just to the subquery_planner before SS_identify_outer_params would be
> > better, that's the place where you see the path for the whole (sub)query
> > so you can search and modify what you need from there.
> 
> I don't like either of those too much.  The main thing I've noticed over
> the past few days is that you can't readily generate custom upper-level
> Paths unless you know what PathTarget grouping_planner is expecting each
> level to produce.  So what I was toying with doing is (1) having
> grouping_planner put all those targets into the PlannerInfo, perhaps
> in an array indexed by UpperRelationKind; and (2) adding a hook call
> immediately after those targets are computed, say right before
> the create_grouping_paths() call (approximately planner.c:1738
> in HEAD).  It should be sufficient to have one hook there since
> you can inject Paths into any of the upper relations at that point;
> moreover, that's late enough that you shouldn't have to recompute
> anything you figured out during scan/join planning.
>
Regarding to the (2), I doubt whether the location is reasonable,
because pathlist of each upper_rels[] are still empty, aren't it?
It will make extension not-easy to construct its own CustomPath that
takes underlying built-in pathnodes.

For example, an extension implements its own sort logic but not
interested in group-by/window function, it shall want to add its
CustomPath to UPPERREL_ORDERED, however, it does not know which is
the input_rel and no built-in paths are not added yet at the point
of create_upper_paths_hook().

On the other hands, here is another problem if we put a hook after
all the upper paths done. In this case, built-in create__paths()
functions cannot pay attention for CustomPath to be added later when
these functions pick up the cheapest path.

So, even though we don't need to define multiple hook declarations,
I think the hook invocation is needed just after create__paths()
for each. It will need to inform extension the context of hook
invocation, the argument list will take UpperRelationKind.

In addition, extension cannot reference some local variables from
the root structure, like:
 - rollup_lists
 - rollup_groupclauses
 - wflists
 - activeWindows
 - have_postponed_srfs
As we are doing at set_join_pathlist_hook, it is good idea to define
UpperPathExtraData structure to pack misc information.

So, how about to re-define the hook as follows?

typedef void (*create_upper_paths_hook_type) (UpperRelationKind upper_kind,
  PlannerInfo *root,
  RelOptInfo *scan_join_rel,
  UpperPathExtraData *extra);

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-14 Thread Kouhei Kaigai
> On Sun, Mar 13, 2016 at 9:53 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > OK, I split the previous small patch into two tiny patches.
> > The one is bugfix around max length of the extnodename.
> > The other replaces Assert() by ereport() according to the upthread 
> > discussion.
> 
> Committed, except that (1) I replaced ereport() with elog(), because I
> can't see making translators care about this message; and (2) I
> reworded the error message a bit.
>
Thanks, and I got the point why ereport() is suggested for the error
message that may be visible to users, instead of elog().

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use %u to print user mapping's umid and userid

2016-03-14 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Etsuro Fujita
> Sent: Monday, March 14, 2016 4:59 PM
> To: Ashutosh Bapat; Tom Lane
> Cc: pgsql-hackers
> Subject: Re: [HACKERS] Use %u to print user mapping's umid and userid
> 
> Hi,
> 
> On 2016/02/09 14:09, Ashutosh Bapat wrote:
> > Sorry, I was wrong. For public user mapping userid is 0 (InvalidOid),
> > which is returned as is in UserMapping object. I confused InvalidOid
> > with -1.
> 
> I think the following umid handling in postgresGetForeignPlan has the
> same issue:
> 
>  /*
>   * Build the fdw_private list that will be available to the executor.
>   * Items in the list must match order in enum FdwScanPrivateIndex.
>   */
>  fdw_private = list_make4(makeString(sql.data),
>   retrieved_attrs,
>   makeInteger(fpinfo->fetch_size),
>   makeInteger(foreignrel->umid));
> 
> I don't think it's correct to use makeInteger for the foreignrel's umid.
>
BTW, use of ExtensibleNode allows to forget problems come from data format
translation.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Upper planner pathification

2016-03-13 Thread Kouhei Kaigai
> -Original Message-
> From: Petr Jelinek [mailto:p...@2ndquadrant.com]
> Sent: Monday, March 14, 2016 12:18 PM
> To: Kaigai Kouhei(海外 浩平); Tom Lane; David Rowley
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> On 14/03/16 02:43, Kouhei Kaigai wrote:
> >
> > CustomPath node is originally expected to generate various kind of plan
> > node, not only scan/join, and its interface is designed to support them.
> > For example, we can expect a CustomPath that generates "CustomSort".
> >
> > On the other hands, upper path consideration is more variable than the
> > case of scan/join path consideration. Probably, we can have no centralized
> > point to add custom-paths for sort, group-by, ...
> > So, I think we have hooks for each (supported) upper path workload.
> >
> > In case of sorting for example, the best location of the hook is just
> > above of the Assert() in the create_ordered_paths(). It allows to compare
> > estimated cost between SortPath and CustomPath.
> > However, it does not allow to inject CustomPath(for sort) into the path
> > node that may involve sorting, like WindowPath or AggPath.
> > Thus, another hook may be put on create_window_paths and
> > create_grouping_paths in my thought.
> >
> > Some other good idea?
> >
> > Even though I couldn't check the new planner implementation entirely,
> > it seems to be the points below are good candidate to inject CustomPath
> > (and potentially ForeignScan).
> >
> > - create_grouping_paths
> > - create_window_paths
> > - create_distinct_paths
> > - create_ordered_paths
> > - just below of the create_modifytable_path
> >(may be valuable for foreign-update pushdown)
> >
> 
> To me that seems too low inside the planning tree, perhaps adding it
> just to the subquery_planner before SS_identify_outer_params would be
> better, that's the place where you see the path for the whole (sub)query
> so you can search and modify what you need from there.
>
Thanks for your idea. Yes, I also thought a similar point; where all
the path consideration get completed. It indeed allows extensions to
walk down the path tree and replace a part of them.
However, when we want to inject CustomPath under the built-in paths,
extension has to re-calculate cost of the built-in paths again.
Perhaps, it affects to the decision of built-in path selection.
So, I concluded that it is not realistic to re-implement equivalent
upper planning stuff in the extension side, if we put the hook after
all the planning works done.

If extension can add its CustomPath at create_grouping_paths(), the
later steps, like create_window_paths, stands on the estimated cost
of the CustomPath. Thus, extension don't need to know the detail of
the entire upper planning.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-13 Thread Kouhei Kaigai
> On 14/03/16 02:53, Kouhei Kaigai wrote:
> >> -Original Message-
> >> From: pgsql-hackers-ow...@postgresql.org
> >> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> >> Sent: Friday, March 11, 2016 12:27 AM
> >> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> >>
> >> On 10/03/16 08:08, Kouhei Kaigai wrote:
> >>>>
> >>>>>> Also in RegisterCustomScanMethods
> >>>>>> +  Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> >>>>>>
> >>>>>> Shouldn't this be actually "if" with ereport() considering this is
> >>>>>> public API and extensions can pass anything there? (for that matter 
> >>>>>> same
> >>>>>> is true for RegisterExtensibleNodeMethods but that's already 
> >>>>>> committed).
> >>>>>>
> >>>>> Hmm. I don't have clear answer which is better. The reason why I put
> >>>>> Assert() here is that only c-binary extension uses this interface, thus,
> >>>>> author will fix up the problem of too long name prior to its release.
> >>>>> Of course, if-with-ereport() also informs extension author the name is
> >>>>> too long.
> >>>>> One downside of Assert() may be, it makes oversight if --enable-cassert
> >>>>> was not specified.
> >>>>>
> >>>>
> >>>> Well that's exactly my problem, this should IMHO throw error even
> >>>> without --enable-cassert. It's not like it's some performance sensitive
> >>>> API where if would be big problem, ensuring correctness of the input is
> >>>> more imporant here IMHO.
> >>>>
> >>> We may need to fix up RegisterExtensibleNodeMethods() first.
> >>>
> >>> Also, length limitation is (EXTNODENAME_MAX_LEN-1) because the last byte
> >>> is consumed by '\0' character. In fact, hash, match and keycopy function
> >>> of HTAB for string keys deal with the first (keysize - 1) bytes.
> >>> So, strkey(extnodename) == EXTNODENAME_MAX_LEN is not legal.
> >>>
> >>
> >> Yes, my thoughts as well but that can be separate tiny patch that does
> >> not have to affect this one. In my opinion if we fixed this one it would
> >> be otherwise ready to go in, and I definitely prefer this approach to
> >> the previous one.
> >>
> > OK, I split the previous small patch into two tiny patches.
> > The one is bugfix around max length of the extnodename.
> > The other replaces Assert() by ereport() according to the upthread 
> > discussion.
> >
> 
> Okay, it's somewhat akin to hairsplitting but works for me. Do you plan
> to do same thing with the CustomScan patch itself as well?.
>
Yes. I'll fixup the patch to follow the same manner.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-13 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> Sent: Friday, March 11, 2016 12:27 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On 10/03/16 08:08, Kouhei Kaigai wrote:
> >>
> >>>> Also in RegisterCustomScanMethods
> >>>> +Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> >>>>
> >>>> Shouldn't this be actually "if" with ereport() considering this is
> >>>> public API and extensions can pass anything there? (for that matter same
> >>>> is true for RegisterExtensibleNodeMethods but that's already committed).
> >>>>
> >>> Hmm. I don't have clear answer which is better. The reason why I put
> >>> Assert() here is that only c-binary extension uses this interface, thus,
> >>> author will fix up the problem of too long name prior to its release.
> >>> Of course, if-with-ereport() also informs extension author the name is
> >>> too long.
> >>> One downside of Assert() may be, it makes oversight if --enable-cassert
> >>> was not specified.
> >>>
> >>
> >> Well that's exactly my problem, this should IMHO throw error even
> >> without --enable-cassert. It's not like it's some performance sensitive
> >> API where if would be big problem, ensuring correctness of the input is
> >> more imporant here IMHO.
> >>
> > We may need to fix up RegisterExtensibleNodeMethods() first.
> >
> > Also, length limitation is (EXTNODENAME_MAX_LEN-1) because the last byte
> > is consumed by '\0' character. In fact, hash, match and keycopy function
> > of HTAB for string keys deal with the first (keysize - 1) bytes.
> > So, strkey(extnodename) == EXTNODENAME_MAX_LEN is not legal.
> >
> 
> Yes, my thoughts as well but that can be separate tiny patch that does
> not have to affect this one. In my opinion if we fixed this one it would
> be otherwise ready to go in, and I definitely prefer this approach to
> the previous one.
>
OK, I split the previous small patch into two tiny patches.
The one is bugfix around max length of the extnodename.
The other replaces Assert() by ereport() according to the upthread discussion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



pgsql-v9.6-fix-extnodename-max-len.patch
Description: pgsql-v9.6-fix-extnodename-max-len.patch


pgsql-v9.6-replace-assert-by-ereport-on-register-extnode.patch
Description: pgsql-v9.6-replace-assert-by-ereport-on-register-extnode.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Upper planner pathification

2016-03-13 Thread Kouhei Kaigai
Hello,

I'm now checking the new planner implementation to find out the way to
integrate CustomPath to the upper planner also.
CustomPath node is originally expected to generate various kind of plan
node, not only scan/join, and its interface is designed to support them.
For example, we can expect a CustomPath that generates "CustomSort".

On the other hands, upper path consideration is more variable than the
case of scan/join path consideration. Probably, we can have no centralized
point to add custom-paths for sort, group-by, ...
So, I think we have hooks for each (supported) upper path workload.

In case of sorting for example, the best location of the hook is just
above of the Assert() in the create_ordered_paths(). It allows to compare
estimated cost between SortPath and CustomPath.
However, it does not allow to inject CustomPath(for sort) into the path
node that may involve sorting, like WindowPath or AggPath.
Thus, another hook may be put on create_window_paths and
create_grouping_paths in my thought.

Some other good idea?

Even though I couldn't check the new planner implementation entirely,
it seems to be the points below are good candidate to inject CustomPath
(and potentially ForeignScan).

- create_grouping_paths
- create_window_paths
- create_distinct_paths
- create_ordered_paths
- just below of the create_modifytable_path
  (may be valuable for foreign-update pushdown)

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Sent: Saturday, March 05, 2016 3:02 AM
> To: David Rowley
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> OK, here is a version that I think addresses all of the recent comments:
> 
> * I refactored the grouping-sets stuff as suggested by Robert and David.
> The GroupingSetsPath code is now used *only* when there are grouping sets,
> otherwise what you get is a plain AGG_SORTED AggPath.  This allowed
> removal of a boatload of weird corner cases in the GroupingSets code path,
> so it was a good change.  (Fundamentally, that's cleaning up some
> questionable coding in the grouping sets patch rather than fixing anything
> directly related to pathification, but I like the code better now.)
> 
> * I refactored the handling of targetlists in createplan.c.  After some
> reflection I decided that the disuse_physical_tlist callers fell into
> three separate categories: those that actually needed the exact requested
> tlist to be returned, those that wanted non-bloated tuples because they
> were going to put them into sort or hash storage, and those that needed
> grouping columns to be properly labeled.  The new approach is to pass down
> a "flags" word that specifies which if any of these cases apply at a
> specific plan level.  use_physical_tlist now always makes the right
> decision to start with, and disuse_physical_tlist is gone entirely, which
> should make things a bit faster since we won't uselessly construct and
> discard physical tlists.  The missing logic from make_subplanTargetList
> and locate_grouping_columns is reincarnated in the physical-tlist code.
> 
> * Added explicit limit/offset fields to LimitPath, as requested by Teodor.
> 
> * Removed SortPath.sortgroupclauses.
> 
> * Fixed handling of parallel-query fields in new path node types.
> (BTW, I found what seemed to be a couple of pre-existing bugs of
> the same kind, eg create_mergejoin_path was different from the
> other two kinds of join as to setting parallel_degree.)
> 
> 
> What remains to be done, IMV:
> 
> * Performance testing as per yesterday's discussion.
> 
> * Debug support in outfuncs.c and print_path() for new node types.
> 
> * Clean up unfinished work on function header comments.
> 
> * Write some documentation about how FDWs might use this.
> 
> I'll work on the performance testing next.  Barring unsatisfactory
> results from that, I think this could be committable in a couple
> of days.
> 
>   regards, tom lane



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-09 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> Sent: Thursday, March 10, 2016 11:01 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On 10/03/16 02:18, Kouhei Kaigai wrote:
> >
> >> I am not sure I like the fact that we have this EXTNODENAME_MAX_LEN and
> >> now the CUSTOM_NAME_MAX_LEN with the same length and also they are both
> >> same lenght as NAMEDATALEN I wonder if this shouldn't be somehow
> >> squished to less defines.
> >>
> > Hmm. I just followed the manner in extensible.c, because this label was
> > initially NAMEDATALEN, then Robert changed it with EXTNODENAME_MAX_LEN.
> > I guess he avoid to apply same label on different entities - NAMEDATALEN
> > is a limitation for NameData type, but identifier of extensible-node and
> > custom-scan node are not restricted by.
> >
> 
> Makes sense.
> 
> >> Also in RegisterCustomScanMethods
> >> +  Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> >>
> >> Shouldn't this be actually "if" with ereport() considering this is
> >> public API and extensions can pass anything there? (for that matter same
> >> is true for RegisterExtensibleNodeMethods but that's already committed).
> >>
> > Hmm. I don't have clear answer which is better. The reason why I put
> > Assert() here is that only c-binary extension uses this interface, thus,
> > author will fix up the problem of too long name prior to its release.
> > Of course, if-with-ereport() also informs extension author the name is
> > too long.
> > One downside of Assert() may be, it makes oversight if --enable-cassert
> > was not specified.
> >
> 
> Well that's exactly my problem, this should IMHO throw error even
> without --enable-cassert. It's not like it's some performance sensitive
> API where if would be big problem, ensuring correctness of the input is
> more imporant here IMHO.
>
We may need to fix up RegisterExtensibleNodeMethods() first.

Also, length limitation is (EXTNODENAME_MAX_LEN-1) because the last byte
is consumed by '\0' character. In fact, hash, match and keycopy function
of HTAB for string keys deal with the first (keysize - 1) bytes.
So, strkey(extnodename) == EXTNODENAME_MAX_LEN is not legal.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



pgsql-v9.6-extensible-namelen-check-by-ereport.patch
Description: pgsql-v9.6-extensible-namelen-check-by-ereport.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-09 Thread Kouhei Kaigai
> On 29/02/16 13:07, Kouhei Kaigai wrote:
> >
> > I'd like to adjust a few of custom-scan interface prior to v9.6 freeze.
> >
> > The major point is serialization/deserialization mechanism.
> > Now, extension has to give LibraryName and SymbolName to reproduce
> > same CustomScanMethods on the background worker process side. Indeed,
> > it is sufficient information to pull the table of function pointers.
> >
> > On the other hands, we now have different mechanism to wrap private
> > information - extensible node. It requires extensions to register its
> > ExtensibleNodeMethods identified by name, usually, on _PG_init() time.
> > It is also reasonable way to reproduce same objects on background
> > worker side.
> >
> > However, mixture of two different ways is not good. My preference is
> > what extensible-node is doing rather than what custom-scan is currently
> > doing.
> > The attached patch allows extension to register CustomScanMethods once,
> > then readFunc.c can pull this table by CustomName in string form.
> >
> 
> Agreed, but this will break compatibility right?
>
The manner to pass a pair of library-name and symbol-name is a new feature
in v9.6, not in v9.5, so it is now the last chance to fix up the interface
requirement.

> > I'm not 100% certain whether "nodes/custom-apis.h" is the best location,
> > but somewhere we can put these declarations rather than the primitive
> > header files might be needed.
> 
> custom-apis.c does not sound like right name to me, maybe it can be just
> custom.c but custom.h might be bit too generic, maybe custom_node.h
>
OK, custom_node.h may be better.

> I am not sure I like the fact that we have this EXTNODENAME_MAX_LEN and
> now the CUSTOM_NAME_MAX_LEN with the same length and also they are both
> same lenght as NAMEDATALEN I wonder if this shouldn't be somehow
> squished to less defines.
>
Hmm. I just followed the manner in extensible.c, because this label was
initially NAMEDATALEN, then Robert changed it with EXTNODENAME_MAX_LEN.
I guess he avoid to apply same label on different entities - NAMEDATALEN
is a limitation for NameData type, but identifier of extensible-node and
custom-scan node are not restricted by.

> Also in RegisterCustomScanMethods
> + Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> 
> Shouldn't this be actually "if" with ereport() considering this is
> public API and extensions can pass anything there? (for that matter same
> is true for RegisterExtensibleNodeMethods but that's already committed).
>
Hmm. I don't have clear answer which is better. The reason why I put
Assert() here is that only c-binary extension uses this interface, thus,
author will fix up the problem of too long name prior to its release.
Of course, if-with-ereport() also informs extension author the name is
too long.
One downside of Assert() may be, it makes oversight if --enable-cassert
was not specified.

> Other than that this seems like straight conversion to same basic
> template as extensible nodes so I think it's ok.
> 

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-03-07 Thread Kouhei Kaigai




> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Saturday, March 05, 2016 2:42 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On Thu, Mar 3, 2016 at 8:54 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > I found one other, but tiny, problem to implement SSD-to-GPU direct
> > data transfer feature under the PostgreSQL storage.
> >
> > Extension cannot know the raw file descriptor opened by smgr.
> >
> > I expect an extension issues an ioctl(2) on the special device file
> > on behalf of the special kernel driver, to control the P2P DMA.
> > This ioctl(2) will pack file descriptor of the DMA source and some
> > various information (like base position, range, destination device
> > pointer, ...).
> >
> > However, the raw file descriptor is wrapped in the fd.c, instead of
> > the File handler, thus, not visible to extension. oops...
> >
> > The attached patch provides a way to obtain raw file descriptor (and
> > relevant flags) of a particular File virtual file descriptor on
> > PostgreSQL. (No need to say, extension has to treat the raw descriptor
> > carefully not to give an adverse effect to the storage manager.)
> >
> > How about this tiny enhancement?
> 
> Why not FileDescriptor(), FileFlags(), FileMode() as separate
> functions like FilePathName()?
>
Here is no deep reason. The attached patch adds three individual
functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



pgsql-v9.6-filegetrawdesc.2.patch
Description: pgsql-v9.6-filegetrawdesc.2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-03-03 Thread Kouhei Kaigai
I found one other, but tiny, problem to implement SSD-to-GPU direct
data transfer feature under the PostgreSQL storage.

Extension cannot know the raw file descriptor opened by smgr.

I expect an extension issues an ioctl(2) on the special device file
on behalf of the special kernel driver, to control the P2P DMA.
This ioctl(2) will pack file descriptor of the DMA source and some
various information (like base position, range, destination device
pointer, ...).

However, the raw file descriptor is wrapped in the fd.c, instead of
the File handler, thus, not visible to extension. oops...

The attached patch provides a way to obtain raw file descriptor (and
relevant flags) of a particular File virtual file descriptor on
PostgreSQL. (No need to say, extension has to treat the raw descriptor
carefully not to give an adverse effect to the storage manager.)

How about this tiny enhancement?

> > -Original Message-
> > From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> > Sent: Saturday, February 13, 2016 1:46 PM
> > To: Kaigai Kouhei(海外 浩平)
> > Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> > Subject: Re: [HACKERS] Way to check whether a particular block is on the
> > shared_buffer?
> >
> > On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > > Hmm. In my experience, it is often not a productive discussion whether
> > > a feature is niche or commodity. So, let me change the viewpoint.
> > >
> > > We may utilize OS-level locking mechanism here.
> > >
> > > Even though it depends on filesystem implementation under the VFS,
> > > we may use inode->i_mutex lock that shall be acquired during the buffer
> > > copy from user to kernel, at least, on a few major filesystems; ext4,
> > > xfs and btrfs in my research. As well, the modified NVMe SSD driver can
> > > acquire the inode->i_mutex lock during P2P DMA transfer.
> > >
> > > Once we can consider the OS buffer is updated atomically by the lock,
> > > we don't need to worry about corrupted pages, but still needs to pay
> > > attention to the scenario when updated buffer page is moved to GPU.
> > >
> > > In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
> > > infrastructure, so I intend to move all-visible pages only.
> > > If someone updates the buffer concurrently, then write out the page
> > > including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
> > > updated tuples should not be visible to the transaction which issued
> > > P2P DMA.
> > >
> > > Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
> > > that indicates CPU to retry this page again. In this case, this page is
> > > likely loaded to the shared buffer already, so retry penalty is not so
> > > much.
> > >
> > > I'll try to investigate the implementation in this way.
> > > Please correct me, if I misunderstand something (especially, treatment
> > > of PD_ALL_VISIBLE).
> >
> > I suppose there's no theoretical reason why the buffer couldn't go
> > from all-visible to not-all-visible and back to all-visible again all
> > during the time you are copying it.
> >
> The backend process that is copying the data to GPU has a transaction
> in-progress (= not committed). Is it possible to get the updated buffer
> page back to the all-visible state again?
> I expect that in-progress transactions works as a blocker for backing
> to all-visible. Right?
> 
> > Honestly, I think trying to access buffers without going through
> > shared_buffers is likely to be very hard to make correct and probably
> > a loser.
> >
> No challenge, no outcome. ;-)
> 
> > Copying the data into shared_buffers and then to the GPU is,
> > doubtless, at least somewhat slower.  But I kind of doubt that it's
> > enough slower to make up for all of the problems you're going to have
> > with the approach you've chosen.
> >
> Honestly, I'm still uncertain whether it works well as I expects.
> However, scan workload on the table larger than main memory is
> headache for PG-Strom, so I'd like to try ideas we can implement.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kai...@ak.jp.nec.com>
>



pgsql-v9.6-filegetrawdesc.1.patch
Description: pgsql-v9.6-filegetrawdesc.1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Reworks of CustomScan serialization/deserialization

2016-02-29 Thread Kouhei Kaigai
Hello,

I'd like to adjust a few of custom-scan interface prior to v9.6 freeze.

The major point is serialization/deserialization mechanism.
Now, extension has to give LibraryName and SymbolName to reproduce
same CustomScanMethods on the background worker process side. Indeed,
it is sufficient information to pull the table of function pointers.

On the other hands, we now have different mechanism to wrap private
information - extensible node. It requires extensions to register its
ExtensibleNodeMethods identified by name, usually, on _PG_init() time.
It is also reasonable way to reproduce same objects on background
worker side.

However, mixture of two different ways is not good. My preference is
what extensible-node is doing rather than what custom-scan is currently
doing.
The attached patch allows extension to register CustomScanMethods once,
then readFunc.c can pull this table by CustomName in string form.


The minor one is header file location of CustomMethods declaration.
These are currently declared at relation.h, plannodes.h and execnodes.h.
These files are very primitive, so we put these lines:

  struct ParallelContext; /* avoid including parallel.h here */
  struct shm_toc; /* avoid including shm_toc.h here */
  struct ExplainState;/* avoid including explain.h here */

to avoid inclusion of other headers here.

It seems to me CustomMethods shall be moved to somewhere appropriate,
like fdwapi.h for FDW. If we put "struct CustomMethods;" on these
primitive header files instead, it will work.

I'm not 100% certain whether "nodes/custom-apis.h" is the best location,
but somewhere we can put these declarations rather than the primitive
header files might be needed.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-custom-scan-serialization-reworks.1.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] A trivial fix on extensiblenode

2016-02-29 Thread Kouhei Kaigai
Hello,

RegisterExtensibleNodeMethods() initializes its hash table
with keysize=NAMEDATALEN, instead of EXTNODENAME_MAX_LEN.

The attached patch fixes it.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-v9.6-trivial-fix-extensiblenode.patch
Description: pgsql-v9.6-trivial-fix-extensiblenode.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-13 Thread Kouhei Kaigai




> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Saturday, February 13, 2016 1:46 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > Hmm. In my experience, it is often not a productive discussion whether
> > a feature is niche or commodity. So, let me change the viewpoint.
> >
> > We may utilize OS-level locking mechanism here.
> >
> > Even though it depends on filesystem implementation under the VFS,
> > we may use inode->i_mutex lock that shall be acquired during the buffer
> > copy from user to kernel, at least, on a few major filesystems; ext4,
> > xfs and btrfs in my research. As well, the modified NVMe SSD driver can
> > acquire the inode->i_mutex lock during P2P DMA transfer.
> >
> > Once we can consider the OS buffer is updated atomically by the lock,
> > we don't need to worry about corrupted pages, but still needs to pay
> > attention to the scenario when updated buffer page is moved to GPU.
> >
> > In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
> > infrastructure, so I intend to move all-visible pages only.
> > If someone updates the buffer concurrently, then write out the page
> > including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
> > updated tuples should not be visible to the transaction which issued
> > P2P DMA.
> >
> > Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
> > that indicates CPU to retry this page again. In this case, this page is
> > likely loaded to the shared buffer already, so retry penalty is not so
> > much.
> >
> > I'll try to investigate the implementation in this way.
> > Please correct me, if I misunderstand something (especially, treatment
> > of PD_ALL_VISIBLE).
> 
> I suppose there's no theoretical reason why the buffer couldn't go
> from all-visible to not-all-visible and back to all-visible again all
> during the time you are copying it.
>
The backend process that is copying the data to GPU has a transaction
in-progress (= not committed). Is it possible to get the updated buffer
page back to the all-visible state again?
I expect that in-progress transactions works as a blocker for backing
to all-visible. Right?

> Honestly, I think trying to access buffers without going through
> shared_buffers is likely to be very hard to make correct and probably
> a loser.
>
No challenge, no outcome. ;-)

> Copying the data into shared_buffers and then to the GPU is,
> doubtless, at least somewhat slower.  But I kind of doubt that it's
> enough slower to make up for all of the problems you're going to have
> with the approach you've chosen.
>
Honestly, I'm still uncertain whether it works well as I expects.
However, scan workload on the table larger than main memory is
headache for PG-Strom, so I'd like to try ideas we can implement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-11 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, February 11, 2016 1:26 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan 
> support
> on readfuncs.c)
> 
> On Wed, Feb 10, 2016 at 1:25 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > It is pretty good!
> >
> > The attached patch (primary one) implements the above idea.
> >
> > Now ExtensibleNode works as a basis structure of data container,
> > regardless of CustomScan and ForeignScan.
> > Also, fdw_private and custom_private are de-defined to Node * type
> > from List * type. It affected to a few FDW APIs.
> 
> I extracted the subset of this that just creates the extensible node
> framework and did some cleanup - the result is attached.  I will
> commit this if nobody objects.
>
I have no objection of course.

> I think the part about whacking around the FDW API is a little more
> potentially objectionable to others, so I want to hold off doing that
> unless a few more people chime in with +1.  Perhaps we could start a
> new thread to talk about that specific idea.  This is useful even
> without that, though.
>

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-11 Thread Kouhei Kaigai
> On Tue, Feb 9, 2016 at 6:35 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > Unfortunately, it was not sufficient.
> >
> > Due to the assumption, the buffer page to be suspended does not exist
> > when a backend process issues a series P2P DMA command. (If block would
> > be already loaded to the shared buffer, it don't need to issue P2P DMA,
> > but just use usual memory<->device DMA because RAM is much faster than
> > SSD.)
> > It knows the pair of (rel,fork,block), but no BufferDesc of this block
> > exists. Thus, it cannot acquire locks in BufferDesc structure.
> >
> > Even if the block does not exist at this point, concurrent process may
> > load the same page. BufferDesc of this page shall be assigned at this
> > point, however, here is no chance to lock something in BufferDesc for
> > the process which issues P2P DMA command.
> >
> > It is the reason why I assume the suspend/resume mechanism shall take
> > a pair of (rel,fork,block) as identifier of the target block.
> 
> I see the problem, but I'm not terribly keen on putting in the hooks
> that it would take to let you solve it without hacking core.  It
> sounds like an awfully invasive thing for a pretty niche requirement.
>
Hmm. In my experience, it is often not a productive discussion whether
a feature is niche or commodity. So, let me change the viewpoint.

We may utilize OS-level locking mechanism here.

Even though it depends on filesystem implementation under the VFS,
we may use inode->i_mutex lock that shall be acquired during the buffer
copy from user to kernel, at least, on a few major filesystems; ext4,
xfs and btrfs in my research. As well, the modified NVMe SSD driver can
acquire the inode->i_mutex lock during P2P DMA transfer.

Once we can consider the OS buffer is updated atomically by the lock,
we don't need to worry about corrupted pages, but still needs to pay
attention to the scenario when updated buffer page is moved to GPU.

In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
infrastructure, so I intend to move all-visible pages only.
If someone updates the buffer concurrently, then write out the page
including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
updated tuples should not be visible to the transaction which issued
P2P DMA.

Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
that indicates CPU to retry this page again. In this case, this page is
likely loaded to the shared buffer already, so retry penalty is not so
much.

I'll try to investigate the implementation in this way.
Please correct me, if I misunderstand something (especially, treatment
of PD_ALL_VISIBLE).

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-09 Thread Kouhei Kaigai
> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Wednesday, February 10, 2016 1:58 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: ##freemail## Re: [HACKERS] Way to check whether a particular block is
> on the shared_buffer?
> 
> On Sun, Feb 7, 2016 at 9:49 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > On the other hands, it also became clear we have to guarantee OS buffer
> > or storage block must not be updated partially during the P2P DMA.
> > My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
> > filter out unnecessary rows and columns prior to loading to CPU/RAM.
> > It needs to ensure PostgreSQL does not write out buffers to OS buffers
> > to avoid unexpected data corruption.
> >
> > What I want to achieve is suspend of buffer write towards a particular
> > (relnode, forknum, blocknum) pair for a short time, by completion of
> > data processing by GPU (or other external devices).
> > In addition, it is preferable being workable regardless of the choice
> > of storage manager, even if we may have multiple options on top of the
> > pluggable smgr in the future.
> 
> It seems like you just need to take an exclusive content lock on the
> buffer, or maybe a shared content lock would be sufficient.
>
Unfortunately, it was not sufficient.

Due to the assumption, the buffer page to be suspended does not exist
when a backend process issues a series P2P DMA command. (If block would
be already loaded to the shared buffer, it don't need to issue P2P DMA,
but just use usual memory<->device DMA because RAM is much faster than
SSD.)
It knows the pair of (rel,fork,block), but no BufferDesc of this block
exists. Thus, it cannot acquire locks in BufferDesc structure.

Even if the block does not exist at this point, concurrent process may
load the same page. BufferDesc of this page shall be assigned at this
point, however, here is no chance to lock something in BufferDesc for
the process which issues P2P DMA command.

It is the reason why I assume the suspend/resume mechanism shall take
a pair of (rel,fork,block) as identifier of the target block.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-09 Thread Kouhei Kaigai
> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Monday, February 08, 2016 11:59 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: ##freemail## Re: CustomScan in a larger structure (RE: [HACKERS]
> CustomScan support on readfuncs.c)
> 
> On Sun, Feb 7, 2016 at 7:28 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > The new callbacks of T_ExtensibleNode will replace the necessity to
> > form and deform process of private values, like as:
> >   https://github.com/pg-strom/devel/blob/master/src/gpujoin.c#L114
> 
> Yeah.
> 
> > It transforms a bunch of internal data of CustomScan (similar to the
> > extended fields in T_ExtensibleNode) to/from the node functions
> > understandable forms for copy, input and output support.
> > I think it implies you proposition is workable.
> >
> > I'd like to follow this proposition basically.
> > On the other hands, I have two points I want to pay attention.
> >
> > 1. At this moment, it is allowed to define a larger structure that
> > embeds CustomPath and CustomScanState by extension. How do we treat
> > this coding manner in this case? Especially, CustomScanState has no
> > private pointer dereference because it assumes an extension define
> > a larger structure. Of course, we don't need to care about node
> > operations on Path and PlanState nodes, but right now.
> 
> I see no real advantage in letting a CustomPath be larger.  If
> custom_private can include extension-defined node types, that seems
> good enough.  On the other hand, if CustomScanState can be larger,
> that seems fine.   We don't really need any special support for that,
> do we?
>
Yes. Right now, we have no code path that handles PlanState or its
inheritance using node operations. So, it is not a real problem.

> > 2. I intended to replace LibraryName and SymbolName fields from the
> > CustomScanMethods structure by integration of extensible node type.
> > We had to give a pair of these identifiers because custom scan provider
> > has no registration points at this moment. A little concern is extension
> > has to assume a particular filename of itself.
> > But, probably, it shall be a separated discussion. My preference is
> > preliminary registration of custom scan provider by its name, as well
> > as extensible node.
> 
> Seems like we could just leave the CustomScan stuff alone and worry
> about this as a separate facility.
>
OK

> > Towards the last question; whether *_private shall be void * or List *,
> > I want to keep fdw_private and custom_private as List * pointer, because
> > a new node type definition is a bit overdone job if this FDW or CSP will
> > take only a few private fields with primitive data types.
> > It is a preferable features when extension defines ten or more private
> > fields.
> 
> Well, I suggested Node *, not void *.  A Node can be a List, but not
> every Node is a List.
>
It is pretty good!

The attached patch (primary one) implements the above idea.

Now ExtensibleNode works as a basis structure of data container,
regardless of CustomScan and ForeignScan.
Also, fdw_private and custom_private are de-defined to Node * type
from List * type. It affected to a few FDW APIs.

The secondary patch is a demonstration of new ExtensibleNode using
postgres_fdw extension. Its private data are expected to be packed
in a list with a particular order. Self defined structure allows to
keep these variables without ugly pack/unpacking.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


pgsql-v9.6-custom-private.v5.demo.patch
Description: pgsql-v9.6-custom-private.v5.demo.patch


pgsql-v9.6-custom-private.v5.patch
Description: pgsql-v9.6-custom-private.v5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-07 Thread Kouhei Kaigai
> On Wed, Feb 3, 2016 at 11:57 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > At this moment, I tried to write up description at nodes/nodes.h.
> > The amount of description is about 100lines. It is on a borderline
> > whether we split off this chunk into another header file, in my sense.
> >
> >
> > On the other hands, I noticed that we cannot omit checks for individual
> > callbacks on Custom node type, ExtensibleNodeMethods is embedded in
> > the CustomMethods structure, thus we may have Custom node with
> > no extensible feature.
> > This manner is beneficial because extension does not need to register
> > the library and symbol name for serialization. So, CustomScan related
> > code still checks existence of individual callbacks.
> 
> I was looking over this patch yesterday, and something was bugging me
> about it, but I couldn't quite put my finger on what it was.  But now
> I think I know.
> 
> I think of an extensible node type facility as something that ought to
> be defined to allow a user to create new types of nodes.  But this is
> not that.  What this does is allows you to have a CustomScan or
> ForeignScan node - that is, the existing node type - which is actually
> larger than a normal node of that type and has some extra data that is
> part of it.  I'm having a really hard time being comfortable with that
> concept.  Somehow, it seems like the node type should determine the
> size of the node.  I can stretch my brain to the point of being able
> to say, well, maybe if the node tag is T_ExtensibleNode, then you can
> look at char *extnodename to figure out what type of node it is
> really, and then from there get the size.  But what this does is:
> every time you see a T_CustomScan or T_ForeignScan node, it might not
> really be that kind of node but something else else, and tomorrow
> there might be another half-dozen node types with a similar property.
> And every one of those node types will have char *extnodename in a
> different place in the structure, so a hypothetical piece of code that
> wanted to find the extension methods for a node, or the size of a
> node, would need a switch that knows about all of those node types.
> It feels very awkward.
> 
> So I have a slightly different proposal.  Let's forget about letting
> T_CustomScan or T_ForeignScan or any other built-in node type vary in
> size.  Instead, let's add T_ExtensibleNode which acts as a completely
> user-defined node.  It has read/out/copy/equalfuncs support along the
> lines of what this patch implements, and that's it.  It's not a scan
> or a path or anything like that: it's just an opaque data container
> that the system can input, output, copy, and test for equality, and
> nothing else.  Isn't that useless?  Not at all.  If you're creating an
> FDW or custom scan provider and want to store some extra data, you can
> create a new type of extensible node and stick it in the fdw_private
> or custom_private field!  The data won't be part of the CustomScan or
> ForeignScan structure itself, as in this patch, but who cares? The
> only objection I can see is that you still need several pointer
> deferences to get to the data since fdw_private is a List, but if
> that's actually a real performance problem we could probably fix it by
> just changing fdw_private to a Node instead.  You'd still need one
> pointer dereference, but that doesn't seem too bad.
> 
> Thoughts?
>
The new callbacks of T_ExtensibleNode will replace the necessity to
form and deform process of private values, like as:
  https://github.com/pg-strom/devel/blob/master/src/gpujoin.c#L114

It transforms a bunch of internal data of CustomScan (similar to the
extended fields in T_ExtensibleNode) to/from the node functions
understandable forms for copy, input and output support.
I think it implies you proposition is workable.

I'd like to follow this proposition basically.
On the other hands, I have two points I want to pay attention.

1. At this moment, it is allowed to define a larger structure that
embeds CustomPath and CustomScanState by extension. How do we treat
this coding manner in this case? Especially, CustomScanState has no
private pointer dereference because it assumes an extension define
a larger structure. Of course, we don't need to care about node
operations on Path and PlanState nodes, but right now.

2. I intended to replace LibraryName and SymbolName fields from the
CustomScanMethods structure by integration of extensible node type.
We had to give a pair of these identifiers because custom scan provider
has no registration points at this moment. A little concern is extension
has to assume a particular filename of itself.
But, probably, it shall be a separated discussion. My preference is
preliminary registration o

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-07 Thread Kouhei Kaigai
> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Monday, February 08, 2016 1:52 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is
> on the shared_buffer?
> 
> On Thu, Feb 4, 2016 at 11:34 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > I can agree that smgr hooks shall be primarily designed to make storage
> > systems pluggable, even if we can use this hooks for suspend & resume of
> > write i/o stuff.
> > In addition, "pluggable storage" is a long-standing feature, even though
> > it is not certain whether existing smgr hooks are good starting point.
> > It may be a risk if we implement a grand feature on top of the hooks
> > but out of its primary purpose.
> >
> > So, my preference is a mechanism to hook buffer write to implement this
> > feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
> > has nearly zero cost when no extension activate the feature.)
> > One downside of this approach is larger number of hook points.
> > We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
> > and FlushRelationBuffers, in addition to FlushBuffer, at least.
> 
> I don't understand what you're hoping to achieve by introducing
> pluggability at the smgr layer.  I mean, md.c is pretty much good for
> read and writing from anything that looks like a directory of files.
> Another smgr layer would make sense if we wanted to read and write via
> some kind of network protocol, or if we wanted to have some kind of
> storage substrate that did internally to itself some of the tasks for
> which we are currently relying on the filesystem - e.g. if we wanted
> to be able to use a raw device, or perhaps more plausibly if we wanted
> to reduce the number of separate files we need, or provide a substrate
> that can clip an unused extent out of the middle of a relation
> efficiently.  But I don't understand what this has to do with what
> you're trying to do here.  The subject of this thread is about whether
> you can check for the presence of a block in shared_buffers, and as
> discussed upthread, you can.  I don't quite follow how we made the
> jump from there to smgr pluggability.
>
Yes. smgr pluggability is not what I want to investigate in this thread.
It is not a purpose of discussion, but one potential "idea to implement".

Through the discussion, it became clear that extension can check existence
of buffer of a particular block, using existing infrastructure.

On the other hands, it also became clear we have to guarantee OS buffer
or storage block must not be updated partially during the P2P DMA.
My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
filter out unnecessary rows and columns prior to loading to CPU/RAM.
It needs to ensure PostgreSQL does not write out buffers to OS buffers
to avoid unexpected data corruption.

What I want to achieve is suspend of buffer write towards a particular
(relnode, forknum, blocknum) pair for a short time, by completion of
data processing by GPU (or other external devices).
In addition, it is preferable being workable regardless of the choice
of storage manager, even if we may have multiple options on top of the
pluggable smgr in the future.

The data processing close to the storage needs OS buffer should not be
updated under the P2P DMA, concurrently. So, I want the feature below.
1. An extension (that controls GPU and P2P DMA) can register a particular
   (relnode, forknum, blocknum) pair as suspended block for write.
2. Once a particular block gets suspended, smgrwrite (or its caller) shall
   be blocked unless the above suspended block is not unregistered.
3. The extension will unregister when P2P DMA from the blocks get completed,
   then suspended concurrent backend shall be resumed to write i/o.
4. On the other hands, the extension cannot register the block if some
   other concurrent executes smgrwrite, to avoid potential data flaw.

One idea was injection of a thin layer on top of the smgr mechanism, to
implement the above mechanism.
However, I'm also uncertain whether injection to entire smgr hooks is
a straightforward approach to achieve it.

The minimum stuff I want is a facility to get a control at the head and tail
of smgrwrite() - to suspend the concurrent write prior to smgr_write, and to
notify the concurrent smgr_write gets completed for the mechanism.

It does not need pluggability of smgr, but entrypoint shall be located around
smgr functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-04 Thread Kouhei Kaigai
> -Original Message-
> From: Jim Nasby [mailto:jim.na...@bluetreble.com]
> Sent: Friday, February 05, 2016 9:17 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org; Robert Haas
> Cc: Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
> >> 2. A feature to suspend i/o write-out towards a particular blocks
> >> >that are registered by other concurrent backend, unless it is not
> >> >unregistered (usually, at the end of P2P DMA).
> >> >==> to be discussed.
> 
> I think there's still a race condition here though...
> 
> A
> finds buffer not in shared buffers
> 
> B
> reads buffer in
> modifies buffer
> starts writing buffer to OS
> 
> A
> Makes call to block write, but write is already in process; thinks
> writes are now blocked
> Reads corrupted block
> Much hilarity ensues
> 
> Or maybe you were just glossing over that part for brevity.
> 
> ...
> 
> > I tried to design a draft of enhancement to realize the above i/o write-out
> > suspend/resume, with less invasive way as possible as we can.
> >
> >ASSUMPTION: I intend to implement this feature as a part of extension,
> >because this i/o suspend/resume checks are pure overhead increment
> >for the core features, unless extension which utilizes it.
> >
> > Three functions shall be added:
> >
> > extern intGetStorageMgrNumbers(void);
> > extern f_smgr GetStorageMgrHandlers(int smgr_which);
> > extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> >
> > As literal, GetStorageMgrNumbers() returns the number of storage manager
> > currently installed. It always return 1 right now.
> > GetStorageMgrHandlers() returns the currently configured f_smgr table to
> > the supplied smgr_which. It allows extensions to know current configuration
> > of the storage manager, even if other extension already modified it.
> > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> > the current one.
> > If extension wants to intermediate 'smgr_write', extension will replace
> > the 'smgr_write' by own function, then call the original function, likely
> > mdwrite, from the alternative function.
> >
> > In this case, call chain shall be:
> >
> >FlushBuffer, and others...
> > +-- smgrwrite(...)
> >  +-- (extension's own function)
> >   +-- mdwrite
> 
> ISTR someone (Robert Haas?) complaining that this method of hooks is
> cumbersome to use and can be fragile if multiple hooks are being
> installed. So maybe we don't want to extend it's usage...
> 
> I'm also not sure whether this is better done with an smgr hook or a
> hook into shared buffer handling...
>
# sorry, I oversight the later part of your reply.

I can agree that smgr hooks shall be primarily designed to make storage
systems pluggable, even if we can use this hooks for suspend & resume of
write i/o stuff.
In addition, "pluggable storage" is a long-standing feature, even though
it is not certain whether existing smgr hooks are good starting point.
It may be a risk if we implement a grand feature on top of the hooks
but out of its primary purpose.

So, my preference is a mechanism to hook buffer write to implement this
feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
has nearly zero cost when no extension activate the feature.)
One downside of this approach is larger number of hook points.
We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
and FlushRelationBuffers, in addition to FlushBuffer, at least.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-04 Thread Kouhei Kaigai
> On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
> >> 2. A feature to suspend i/o write-out towards a particular blocks
> >> >that are registered by other concurrent backend, unless it is not
> >> >unregistered (usually, at the end of P2P DMA).
> >> >==> to be discussed.
> 
> I think there's still a race condition here though...
> 
> A
> finds buffer not in shared buffers
> 
> B
> reads buffer in
> modifies buffer
> starts writing buffer to OS
> 
> A
> Makes call to block write, but write is already in process; thinks
> writes are now blocked
> Reads corrupted block
> Much hilarity ensues
> 
> Or maybe you were just glossing over that part for brevity.
>
Thanks, this part was not clear from my previous description.

At the time when B starts writing buffer to OS, extension will catch
this i/o request using a hook around the smgrwrite, then the mechanism
registers the block to block P2P DMA request during B's write operation.
(Of course, it unregisters the block at end of the smgrwrite)
So, even if A wants to issue P2P DMA concurrently, it cannot register
the block until B's write operation.

In practical, this operation shall be "try lock", because B's write
operation implies existence of the buffer in main memory, so B does
not need to wait A's write operation if B switch DMA source from SSD
to main memory.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

> ...
> 
> > I tried to design a draft of enhancement to realize the above i/o write-out
> > suspend/resume, with less invasive way as possible as we can.
> >
> >ASSUMPTION: I intend to implement this feature as a part of extension,
> >because this i/o suspend/resume checks are pure overhead increment
> >for the core features, unless extension which utilizes it.
> >
> > Three functions shall be added:
> >
> > extern intGetStorageMgrNumbers(void);
> > extern f_smgr GetStorageMgrHandlers(int smgr_which);
> > extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> >
> > As literal, GetStorageMgrNumbers() returns the number of storage manager
> > currently installed. It always return 1 right now.
> > GetStorageMgrHandlers() returns the currently configured f_smgr table to
> > the supplied smgr_which. It allows extensions to know current configuration
> > of the storage manager, even if other extension already modified it.
> > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> > the current one.
> > If extension wants to intermediate 'smgr_write', extension will replace
> > the 'smgr_write' by own function, then call the original function, likely
> > mdwrite, from the alternative function.
> >
> > In this case, call chain shall be:
> >
> >FlushBuffer, and others...
> > +-- smgrwrite(...)
> >  +-- (extension's own function)
> >   +-- mdwrite
> 
> ISTR someone (Robert Haas?) complaining that this method of hooks is
> cumbersome to use and can be fragile if multiple hooks are being
> installed. So maybe we don't want to extend it's usage...
> 
> I'm also not sure whether this is better done with an smgr hook or a
> hook into shared buffer handling...
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CustomScan under the Gather node?

2016-02-03 Thread Kouhei Kaigai
> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Thursday, February 04, 2016 2:54 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] CustomScan under the Gather node?
> 
> On Thu, Jan 28, 2016 at 8:14 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> >>  total ForeignScandiff
> >> 0 workers: 17584.319 ms   17555.904 ms  28.415 ms
> >> 1 workers: 18464.476 ms   18110.968 ms 353.508 ms
> >> 2 workers: 19042.755 ms   14580.335 ms4462.420 ms
> >> 3 workers: 19318.254 ms   12668.912 ms6649.342 ms
> >> 4 workers: 21732.910 ms   13596.788 ms8136.122 ms
> >> 5 workers: 23486.846 ms   14533.409 ms8953.437 ms
> >>
> >> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
> >> peak performance on ForeignScan portion. On the other hands, nworkers>1 
> >> also
> >> recorded unignorable time consumption (probably, by Gather node?)
> >   :
> >> Further investigation will need
> >>
> > It was a bug of my file_fdw patch. ForeignScan node in the master process 
> > was
> > also kicked by the Gather node, however, it didn't have coordinate 
> > information
> > due to oversight of the initialization at InitializeDSMForeignScan callback.
> > In the result, local ForeignScan node is still executed after the completion
> > of coordinated background worker processes, and returned twice amount of 
> > rows.
> >
> > In the revised patch, results seems to me reasonable.
> >  total ForeignScan  diff
> > 0 workers: 17592.498 ms   17564.457 ms 28.041ms
> > 1 workers: 12152.998 ms   11983.485 ms169.513 ms
> > 2 workers: 10647.858 ms   10502.100 ms145.758 ms
> > 3 workers:  9635.445 ms9509.899 ms125.546 ms
> > 4 workers: 11175.456 ms   10863.293 ms312.163 ms
> > 5 workers: 12586.457 ms   12279.323 ms307.134 ms
> 
> Hmm.  Is the file_fdw part of this just a demo, or do you want to try
> to get that committed?  If so, maybe start a new thread with a more
> appropriate subject line to just talk about that.  I haven't
> scrutinized that part of the patch in any detail, but the general
> infrastructure for FDWs and custom scans to use parallelism seems to
> be in good shape, so I rewrote the documentation and committed that
> part.
>
Thanks, I expect file_fdw part is just for demonstration.
It does not require any special hardware to reproduce this parallel
execution, rather than GpuScan of PG-Strom.

> Do you have any idea why this isn't scaling beyond, uh, 1 worker?
> That seems like a good thing to try to figure out.
>
The hardware I run the above query has 4 CPU cores, so it is not
surprising that 3 workers (+ 1 master) recorded the peak performance.

In addition, enhancement of file_fdw part is a corner-cutting work.

It picks up the next line number to be fetched from the shared memory
segment using pg_atomic_add_fetch_u32(), then it reads the input file
until worker meets the target line. Unrelated line shall be ignored.
Individual worker parses its responsible line only, thus, parallel
execution makes sense in this part. On the other hands, total amount
of CPU cycles for file scan will increase because all the workers
at least have to parse all the lines.

If we would simply split time consumption factor in 0 worker case
as follows:
  (time to scan file; TSF) + (time to parse lines; TPL)

Total amount of workloads when we distribute file_fdw into N workers is:

  N * (TSF) + (TPL)

Thus, individual worker has to process the following amount of works:

  (TSF) + (TPL)/N

It is a typical formula of Amdahl's law when sequencial part is not
small. The above result says, TSF part is about 7.4s, TPL part is
about 10.1s.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-03 Thread Kouhei Kaigai
> On Wed, Jan 27, 2016 at 9:36 PM, Robert Haas <robertmh...@gmail.com> wrote:
> > On Mon, Jan 25, 2016 at 8:06 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> >> Sorry for my late response. I've been unavailable to have enough
> >> time to touch code for the last 1.5 month.
> >>
> >> The attached patch is a revised one to handle private data of
> >> foregn/custom scan node more gracefully.
> >>
> >> The overall consensus upthread were:
> >> - A new ExtensibleNodeMethods structure defines a unique name
> >>   and a set of callbacks to handle node copy, serialization,
> >>   deserialization and equality checks.
> >> - (Foreign|Custom)(Path|Scan|ScanState) are first host of the
> >>   ExtensibleNodeMethods, to allow extension to define larger
> >>   structure to store its private fields.
> >> - ExtensibleNodeMethods does not support variable length
> >>   structure (like a structure with an array on the tail, use
> >>   separately allocated array).
> >> - ExtensibleNodeMethods shall be registered on _PG_init() of
> >>   extensions.
> >>
> >> The 'pgsql-v9.6-custom-private.v3.patch' is the main part of
> >> this feature. As I pointed out before, it uses dynhash instead
> >> of the self invented hash table.
> >
> > On a first read-through, I see nothing in this patch to which I would
> > want to object except for the fact that the comments and documentation
> > need some work from a native speaker of English.  It looks like what
> > we discussed, and I think it's an improvement over what we have now.
> 
> Well, looking at this a bit more, it seems like the documentation
> you've written here is really misplaced.  The patch is introducing a
> new facility that applies to both CustomScan and ForeignScan, but the
> documentation is only to do with CustomScan.  I think we need a whole
> new chapter on extensible nodes, or something.  I'm actually not
> really keen on the fact that we keep adding SGML documentation for
> this stuff; it seems like it belongs in a README in the source tree.
> We don't explain nodes in general, but now we're going to have to try
> to explain extensible nodes.  How's that going to work?
>
The detail of these callbacks are not for end-users, administrators and
so on except for core/extension developers. So, I loves idea not to have
such a detailed description in SGML.
How about an idea to have more detailed source code comments close to
the definition of ExtensibleNodeMethods?
I haven't seen the src/backend/nodes/README yet, and it has only 6 updates
history from Postgres95 era. I guess people may forget to update README
file if description is separately located from the implementation.

> I think you should avoid the call to GetExtensibleNodeMethods() in the
> case where extnodename is NULL.  On the other hand, I think that if
> extnodename is non-NULL, all four methods should be required, so that
> you don't have to check if (methods && methods->nodeRead) but just if
> (extnodename) { methods = GetExtensibleNodeMethods(extnodename);
> methods->nodeRead( ... ); }.  That seems like it would be a bit
> tidier.
>
OK, I'll fix up. No need to have 'missing_ok' argument here.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-03 Thread Kouhei Kaigai
> > KaiGai-san,
> >
> > On 2016/02/01 10:38, Kouhei Kaigai wrote:
> > > As an aside, background of my motivation is the slide below:
> > > http://www.slideshare.net/kaigai/sqlgpussd-english
> > > (LT slides in JPUG conference last Dec)
> > >
> > > I'm under investigation of SSD-to-GPU direct feature on top of
> > > the custom-scan interface. It intends to load a bunch of data
> > > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > > It only makes sense if the target blocks are not loaded to the
> > > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > > So, I like to have a reliable way to check the latest status of
> > > the shared buffer, to kwon whether a particular block is already
> > > loaded or not.
> >
> > Quite interesting stuff, thanks for sharing!
> >
> > I'm in no way expert on this but could this generally be attacked from the
> > smgr API perspective? Currently, we have only one implementation - md.c
> > (the hard-coded RelationData.smgr_which = 0). If we extended that and
> > provided end-to-end support so that there would be md.c alternatives to
> > storage operations, I guess that would open up opportunities for
> > extensions to specify smgr_which as an argument to ReadBufferExtended(),
> > provided there is already support in place to install md.c alternatives
> > (perhaps in .so). Of course, these are just musings and, perhaps does not
> > really concern the requirements of custom scan methods you have been
> > developing.
> >
> Thanks for your idea. Indeed, smgr hooks are good candidate to implement
> the feature, however, what I need is a thin intermediation layer rather
> than alternative storage engine.
> 
> It becomes clear we need two features here.
> 1. A feature to check whether a particular block is already on the shared
>buffer pool.
>It is available. BufTableLookup() under the BufMappingPartitionLock
>gives us the information we want.
> 
> 2. A feature to suspend i/o write-out towards a particular blocks
>that are registered by other concurrent backend, unless it is not
>unregistered (usually, at the end of P2P DMA).
>==> to be discussed.
> 
> When we call smgrwrite(), like FlushBuffer(), it fetches function pointer
> from the 'smgrsw' array, then calls smgr_write.
> 
>   void
>   smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
> char *buffer, bool skipFsync)
>   {
>   (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
> buffer, skipFsync);
>   }
> 
> If extension would overwrite smgrsw[] array, then call the original
> function under the control by extension, it allows to suspend the call
> of the original smgr_write until completion of P2P DMA.
> 
> It may be a minimum invasive way to implement, and portable to any
> further storage layers.
> 
> How about your thought? Even though it is a bit different from your
> original proposition.
>
I tried to design a draft of enhancement to realize the above i/o write-out
suspend/resume, with less invasive way as possible as we can.

  ASSUMPTION: I intend to implement this feature as a part of extension,
  because this i/o suspend/resume checks are pure overhead increment
  for the core features, unless extension which utilizes it.

Three functions shall be added:

extern intGetStorageMgrNumbers(void);
extern f_smgr GetStorageMgrHandlers(int smgr_which);
extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);

As literal, GetStorageMgrNumbers() returns the number of storage manager
currently installed. It always return 1 right now.
GetStorageMgrHandlers() returns the currently configured f_smgr table to
the supplied smgr_which. It allows extensions to know current configuration
of the storage manager, even if other extension already modified it.
SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
the current one.
If extension wants to intermediate 'smgr_write', extension will replace
the 'smgr_write' by own function, then call the original function, likely
mdwrite, from the alternative function.

In this case, call chain shall be:

  FlushBuffer, and others...
   +-- smgrwrite(...)
+-- (extension's own function)
 +-- mdwrite

Once extension's own function blocks write i/o until P2P DMA completed by
concurrent process, we don't need to care about partial update of OS cache
or storage device.
It is not difficult for extensions to implement a feature to track/untrack
a pair of (relFileNode, forkNum, blockNum), 

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-03 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, February 04, 2016 11:39 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan 
> support
> on readfuncs.c)
> 
> On Wed, Feb 3, 2016 at 8:00 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> >> Well, looking at this a bit more, it seems like the documentation
> >> you've written here is really misplaced.  The patch is introducing a
> >> new facility that applies to both CustomScan and ForeignScan, but the
> >> documentation is only to do with CustomScan.  I think we need a whole
> >> new chapter on extensible nodes, or something.  I'm actually not
> >> really keen on the fact that we keep adding SGML documentation for
> >> this stuff; it seems like it belongs in a README in the source tree.
> >> We don't explain nodes in general, but now we're going to have to try
> >> to explain extensible nodes.  How's that going to work?
> >>
> > The detail of these callbacks are not for end-users, administrators and
> > so on except for core/extension developers. So, I loves idea not to have
> > such a detailed description in SGML.
> > How about an idea to have more detailed source code comments close to
> > the definition of ExtensibleNodeMethods?
> > I haven't seen the src/backend/nodes/README yet, and it has only 6 updates
> > history from Postgres95 era. I guess people may forget to update README
> > file if description is separately located from the implementation.
> 
> Hmm, that might work, although that file is so old that it may be
> difficult to add to.  Another idea is: maybe we could have a header
> file for the extensible node stuff and just give it a really long
> header comment.
>
At this moment, I tried to write up description at nodes/nodes.h.
The amount of description is about 100lines. It is on a borderline
whether we split off this chunk into another header file, in my sense.


On the other hands, I noticed that we cannot omit checks for individual
callbacks on Custom node type, ExtensibleNodeMethods is embedded in
the CustomMethods structure, thus we may have Custom node with
no extensible feature.
This manner is beneficial because extension does not need to register
the library and symbol name for serialization. So, CustomScan related
code still checks existence of individual callbacks.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


pgsql-v9.6-custom-private.v4.patch
Description: pgsql-v9.6-custom-private.v4.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-02 Thread Kouhei Kaigai
> > > On 1/31/16 7:38 PM, Kouhei Kaigai wrote:
> 
> > > To answer your direct question, I'm no expert, but I haven't seen any
> > > functions that do exactly what you want. You'd have to pull relevant
> > > bits from ReadBuffer_*. Or maybe a better method would just be to call
> > > BufTableLookup() without any locks and if you get a result > -1 just
> > > call the relevant ReadBuffer function. Sometimes you'll end up calling
> > > ReadBuffer even though the buffer isn't in shared buffers, but I would
> > > think that would be a rare occurrence.
> > >
> > Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
> > has a good example for this.
> >
> > If it returned a valid buf_id, we have nothing difficult; just call
> > ReadBuffer() to pin the buffer.
> 
> Isn't this what (or very similar to)
> ReadBufferExtended(RBM_ZERO_AND_LOCK) is already doing?
>
This operation actually acquires a buffer page, fills up with zero
and a valid buffer page is wiped out if no free buffer page.
I want to keep the contents of the shared buffer already loaded on
the main memory. P2P DMA and GPU preprocessing intends to minimize
main memory consumption by rows to be filtered by scan qualifiers.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-01 Thread Kouhei Kaigai
> KaiGai-san,
> 
> On 2016/02/01 10:38, Kouhei Kaigai wrote:
> > As an aside, background of my motivation is the slide below:
> > http://www.slideshare.net/kaigai/sqlgpussd-english
> > (LT slides in JPUG conference last Dec)
> >
> > I'm under investigation of SSD-to-GPU direct feature on top of
> > the custom-scan interface. It intends to load a bunch of data
> > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > It only makes sense if the target blocks are not loaded to the
> > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > So, I like to have a reliable way to check the latest status of
> > the shared buffer, to kwon whether a particular block is already
> > loaded or not.
> 
> Quite interesting stuff, thanks for sharing!
> 
> I'm in no way expert on this but could this generally be attacked from the
> smgr API perspective? Currently, we have only one implementation - md.c
> (the hard-coded RelationData.smgr_which = 0). If we extended that and
> provided end-to-end support so that there would be md.c alternatives to
> storage operations, I guess that would open up opportunities for
> extensions to specify smgr_which as an argument to ReadBufferExtended(),
> provided there is already support in place to install md.c alternatives
> (perhaps in .so). Of course, these are just musings and, perhaps does not
> really concern the requirements of custom scan methods you have been
> developing.
>
Thanks for your idea. Indeed, smgr hooks are good candidate to implement
the feature, however, what I need is a thin intermediation layer rather
than alternative storage engine.

It becomes clear we need two features here.
1. A feature to check whether a particular block is already on the shared
   buffer pool.
   It is available. BufTableLookup() under the BufMappingPartitionLock
   gives us the information we want.

2. A feature to suspend i/o write-out towards a particular blocks
   that are registered by other concurrent backend, unless it is not
   unregistered (usually, at the end of P2P DMA).
   ==> to be discussed.

When we call smgrwrite(), like FlushBuffer(), it fetches function pointer
from the 'smgrsw' array, then calls smgr_write.

  void
  smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer, bool skipFsync)
  {
  (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
buffer, skipFsync);
  }

If extension would overwrite smgrsw[] array, then call the original
function under the control by extension, it allows to suspend the call
of the original smgr_write until completion of P2P DMA.

It may be a minimum invasive way to implement, and portable to any
further storage layers.

How about your thought? Even though it is a bit different from your
original proposition.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-01 Thread Kouhei Kaigai
> On 1/31/16 7:38 PM, Kouhei Kaigai wrote:
> > I'm under investigation of SSD-to-GPU direct feature on top of
> > the custom-scan interface. It intends to load a bunch of data
> > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > It only makes sense if the target blocks are not loaded to the
> > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > So, I like to have a reliable way to check the latest status of
> > the shared buffer, to kwon whether a particular block is already
> > loaded or not.
> 
> That completely ignores the OS cache though... wouldn't that be a major
> issue?
>
Once we can ensure the target block is not cached in the shared buffer,
it is a job of the driver that support P2P DMA to handle OS page cache.
Once driver get a P2P DMA request from PostgreSQL, it checks OS page
cache status and determine the DMA source; whether OS buffer or SSD block.

> To answer your direct question, I'm no expert, but I haven't seen any
> functions that do exactly what you want. You'd have to pull relevant
> bits from ReadBuffer_*. Or maybe a better method would just be to call
> BufTableLookup() without any locks and if you get a result > -1 just
> call the relevant ReadBuffer function. Sometimes you'll end up calling
> ReadBuffer even though the buffer isn't in shared buffers, but I would
> think that would be a rare occurrence.
>
Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
has a good example for this.

If it returned a valid buf_id, we have nothing difficult; just call
ReadBuffer() to pin the buffer.

Elsewhere, when BufTableLookup() returned negative, it means a pair of
(relation, forknum, blocknum) does not exist on the shared buffer.
So, extension enqueues P2P DMA request for asynchronous translation,
then driver processes the P2P DMA soon but later.
Concurrent access may always happen. PostgreSQL uses MVCC, so the backend
which issued P2P DMA does not need to pay attention for new tuples that
didn't exist on executor start time, even if other backend loads and
updates the same buffer just after the above BufTableLookup().

On the other hands, we have to pay attention whether a fraction of
the buffer page is partially written to OS buffer or storage. It is
in the scope of operating system, so it is not controllable from us.

One idea I can find out is, temporary suspension of FlushBuffer() for
a particular pairs of (relation, forknum, blocknum) until P2P DMA gets
completed. Even if concurrent backend updates the buffer page after the
BufTableLookup(), it allows to prevent OS caches and storages getting
dirty during the P2P DMA.

How about people's thought?
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-01-31 Thread Kouhei Kaigai
Hello,

Do we have a reliable way to check whether a particular heap block
is already on the shared buffer, but not modify?

Right now, ReadBuffer and ReadBufferExtended are entrypoint of the
buffer manager for extensions. However, it tries to acquire an
available buffer pool instead of the victim buffer, regardless of
the ReadBufferMode.

It is different from what I want to do:
 1. Check whether the supplied BlockNum is already loaded on the
shared buffer.
 2. If yes, the caller gets buffer descriptor as usual ReadBuffer.
 3. If not, the caller gets InvalidBuffer without modification of
the shared buffer, also no victim buffer pool.

It allows extensions (likely a custom scan provider) to take
different strategies for large table's scan, according to the
latest status of individual blocks.
If we don't have these interface, it seems to me an enhancement
of the ReadBuffer_common and (Local)BufferAlloc is the only way
to implement the feature.

Of course, we need careful investigation definition of the 'valid'
buffer pool. How about a buffer pool with BM_IO_IN_PROGRESS?
How about a buffer pool that needs storage extend (thus, no relevant
physical storage does not exists yet)? ... and so on.


As an aside, background of my motivation is the slide below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
loading onto CPU/RAM, to preprocess the data to be filtered out.
It only makes sense if the target blocks are not loaded to the
CPU/RAM yet, because SSD device is essentially slower than RAM.
So, I like to have a reliable way to check the latest status of
the shared buffer, to kwon whether a particular block is already
loaded or not.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CustomScan under the Gather node?

2016-01-28 Thread Kouhei Kaigai
> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
>
The attached patch is enhancement of FDW/CSP interface and PoC feature
of file_fdw to scan source file partially. It was smaller enhancement
than my expectations.

It works as follows. This query tried to read 20M rows from a CSV file,
using 3 background worker processes.

postgres=# set max_parallel_degree = 3;
SET
postgres=# explain analyze select * from test_csv where id % 20 = 6;
  QUERY PLAN

 Gather  (cost=1000.00..194108.60 rows=94056 width=52)
 (actual time=0.570..19268.010 rows=200 loops=1)
   Number of Workers: 3
   ->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056 
width=52)
  (actual time=0.180..12744.655 rows=50 
loops=4)
 Filter: ((id % 20) = 6)
 Rows Removed by Filter: 950
 Foreign File: /tmp/testdata.csv
 Foreign File Size: 1504892535
 Planning time: 0.147 ms
 Execution time: 19330.201 ms
(9 rows)


I'm not 100% certain whether this implementation of file_fdw is reasonable
for partial read, however, the callbacks located on the following functions
enabled to implement a parallel-aware custom logic based on the coordination
information.

> * ExecParallelEstimate
> * ExecParallelInitializeDSM
> * ExecParallelInitializeWorker

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

> -Original Message-
> From: Kaigai Kouhei(海外 浩平)
> Sent: Thursday, January 28, 2016 9:33 AM
> To: 'Robert Haas'
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] CustomScan under the Gather node?
> 
> > On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > > What enhancement will be necessary to implement similar feature of
> > > partial seq-scan using custom-scan interface?
> > >
> > > It seems to me callbacks on the three points below are needed.
> > > * ExecParallelEstimate
> > > * ExecParallelInitializeDSM
> > > * ExecParallelInitializeWorker
> > >
> > > Anything else?
> > > Does ForeignScan also need equivalent enhancement?
> >
> > For postgres_fdw, running the query from a parallel worker would
> > change the transaction semantics.  Suppose you begin a transaction,
> > UPDATE data on the foreign server, and then run a parallel query.  If
> > the leader performs the ForeignScan it will see the uncommitted
> > UPDATE, but a worker would have to make its own connection which not
> > be part of the same transaction and which would therefore not see the
> > update.  That's a problem.
> >
> Ah, yes, as long as FDW driver ensure the remote session has no
> uncommitted data, pg_export_snapshot() might provide us an opportunity,
> however, once a session writes something, FDW driver has to prohibit it.
> 
> > Also, for postgres_fdw, and many other FDWs I suspect, the assumption
> > is that most of the work is being done on the remote side, so doing
> > the work in a parallel worker doesn't seem super interesting.  Instead
> > of incurring transfer costs to move the data from remote to local, we
> > incur two sets of transfer costs: first remote to local, then worker
> > to leader.  Ouch.  I think a more promising line of inquiry is to try
> > to provide asynchronous execution when we have something like:
> >
> > Append
> > -> Foreign Scan
> > -> Foreign Scan
> >
> > ...so that we can return a row from whichever Foreign Scan receives
> > data back from the remote server first.
> >
> > So it's not impossible that an FDW author could want this, but mostly
> > probably not.  I think.
> >
> Yes, I also have same opinion. Likely, local parallelism is not
> valuable for the class of FDWs that obtains data from the remote
> server (e.g, postgres_fdw, ...), expect for the case when packing
> and unpacking cost over the network is major bottleneck.
> 
> On the other hands, it will be valuable for the class of FDW that
> performs as a wrapper to local data structure, as like current
> partial seq-scan doing. (e.g, file_fdw, ...)
> Its data source is not under the transaction control, and 'remote
> execution' of these FDWs are eventually executed on the local
> computing resources.
> 
> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kai...@ak.jp.nec.com>



pgsql-v9.6-parallel-cspfdw.v1.patch
Description: pgsql-v9.6-parallel-cspfdw.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CustomScan under the Gather node?

2016-01-28 Thread Kouhei Kaigai
> On Thu, Jan 28, 2016 at 10:50 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> >> If I would make a proof-of-concept patch with interface itself, it
> >> seems to me file_fdw may be a good candidate for this enhancement.
> >> It is not a field for postgres_fdw.
> >>
> > The attached patch is enhancement of FDW/CSP interface and PoC feature
> > of file_fdw to scan source file partially. It was smaller enhancement
> > than my expectations.
> >
> > It works as follows. This query tried to read 20M rows from a CSV file,
> > using 3 background worker processes.
> >
> > postgres=# set max_parallel_degree = 3;
> > SET
> > postgres=# explain analyze select * from test_csv where id % 20 = 6;
> >   QUERY PLAN
> >
> 
> 
> >  Gather  (cost=1000.00..194108.60 rows=94056 width=52)
> >  (actual time=0.570..19268.010 rows=200 loops=1)
> >Number of Workers: 3
> >->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056
> width=52)
> >   (actual time=0.180..12744.655 rows=50
> loops=4)
> >  Filter: ((id % 20) = 6)
> >  Rows Removed by Filter: 950
> >  Foreign File: /tmp/testdata.csv
> >  Foreign File Size: 1504892535
> >  Planning time: 0.147 ms
> >  Execution time: 19330.201 ms
> > (9 rows)
> 
> Could you try it not in parallel and then with 1, 2, 3, and 4 workers
> and post the times for all?
>
The above query has 5% selectivity on the entire CSV file.
Its execution time (total, only ForeignScan) are below

 total ForeignScandiff
0 workers: 17584.319 ms   17555.904 ms  28.415 ms
1 workers: 18464.476 ms   18110.968 ms 353.508 ms
2 workers: 19042.755 ms   14580.335 ms4462.420 ms
3 workers: 19318.254 ms   12668.912 ms6649.342 ms
4 workers: 21732.910 ms   13596.788 ms8136.122 ms
5 workers: 23486.846 ms   14533.409 ms8953.437 ms

This workstation has 4 CPU cores, so it is natural nworkers=3 records the
peak performance on ForeignScan portion. On the other hands, nworkers>1 also
recorded unignorable time consumption (probably, by Gather node?)

An interesting observation was, less selectivity (1% and 0%) didn't change the
result so much. Something consumes CPU time other than file_fdw.

* selectivity 1%
   total   ForeignScan   diff
0 workers: 17573.572 ms   17566.875 ms  6.697 ms
1 workers: 18098.070 ms   18020.790 ms 77.280 ms
2 workers: 18676.078 ms   14600.749 ms   4075.329 ms
3 workers: 18830.597 ms   12731.459 ms   6099.138 ms
4 workers: 21015.842 ms   13590.657 ms   7425.185 ms
5 workers: 22865.496 ms   14634.342 ms   8231.154 ms

* selectivity 0% (...so Gather didn't work hard actually)
  totalForeignScan   diff
0 workers: 17551.011 ms   17550.811 ms  0.200 ms
1 workers: 18055.185 ms   18048.975 ms  6.210 ms
2 workers: 18567.660 ms   14593.974 ms   3973.686 ms
3 workers: 18649.819 ms   12671.429 ms   5978.390 ms
4 workers: 20619.184 ms   13606.715 ms   7012.469 ms
5 workers: 22557.575 ms   14594.420 ms   7963.155 ms

Further investigation will need

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

postgres=# explain analyze select * from test_csv where id % 100 = 100;
   QUERY PLAN
-
 Foreign Scan on test_csv  (cost=0.00..2158874.49 rows=94056 width=52) (actual 
time=17550.811..17550.811 rows=0 loops=1)
   Filter: ((id % 100) = 100)
   Rows Removed by Filter: 2000
   Foreign File: /tmp/testdata.csv
   Foreign File Size: 1504892535
 Planning time: 1.175 ms
 Execution time: 17551.011 ms
(7 rows)

postgres=# SET max_parallel_degree = 1;
SET
postgres=# explain analyze select * from test_csv where id % 100 = 100;
  QUERY PLAN
---
 Gather  (cost=1000.00..194108.60 rows=94056 width=52) (actual 
time=18054.651..18054.651 rows=0 loops=1)
   Number of Workers: 1
   ->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056 
width=52) (actual time=18048.975..18048.975 rows=0 loops=2)
 Filter: ((id % 100) = 100)
 Rows Removed by Filter: 2000
 Foreign File: /tmp/testdata.csv
 Foreign File Size: 1504892535
 Planning time: 0.461 ms

Re: [HACKERS] CustomScan under the Gather node?

2016-01-28 Thread Kouhei Kaigai
>  total ForeignScandiff
> 0 workers: 17584.319 ms   17555.904 ms  28.415 ms
> 1 workers: 18464.476 ms   18110.968 ms 353.508 ms
> 2 workers: 19042.755 ms   14580.335 ms4462.420 ms
> 3 workers: 19318.254 ms   12668.912 ms6649.342 ms
> 4 workers: 21732.910 ms   13596.788 ms8136.122 ms
> 5 workers: 23486.846 ms   14533.409 ms8953.437 ms
> 
> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
> peak performance on ForeignScan portion. On the other hands, nworkers>1 also
> recorded unignorable time consumption (probably, by Gather node?)
  :
> Further investigation will need
>
It was a bug of my file_fdw patch. ForeignScan node in the master process was
also kicked by the Gather node, however, it didn't have coordinate information
due to oversight of the initialization at InitializeDSMForeignScan callback.
In the result, local ForeignScan node is still executed after the completion
of coordinated background worker processes, and returned twice amount of rows.

In the revised patch, results seems to me reasonable.
 total ForeignScan  diff
0 workers: 17592.498 ms   17564.457 ms 28.041ms
1 workers: 12152.998 ms   11983.485 ms169.513 ms
2 workers: 10647.858 ms   10502.100 ms145.758 ms
3 workers:  9635.445 ms9509.899 ms125.546 ms
4 workers: 11175.456 ms   10863.293 ms312.163 ms
5 workers: 12586.457 ms   12279.323 ms307.134 ms

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Friday, January 29, 2016 8:51 AM
> To: Robert Haas
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] CustomScan under the Gather node?
> 
> > On Thu, Jan 28, 2016 at 10:50 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> 
> > wrote:
> > >> If I would make a proof-of-concept patch with interface itself, it
> > >> seems to me file_fdw may be a good candidate for this enhancement.
> > >> It is not a field for postgres_fdw.
> > >>
> > > The attached patch is enhancement of FDW/CSP interface and PoC feature
> > > of file_fdw to scan source file partially. It was smaller enhancement
> > > than my expectations.
> > >
> > > It works as follows. This query tried to read 20M rows from a CSV file,
> > > using 3 background worker processes.
> > >
> > > postgres=# set max_parallel_degree = 3;
> > > SET
> > > postgres=# explain analyze select * from test_csv where id % 20 = 6;
> > >   QUERY PLAN
> > >
> >
> 
> > 
> > >  Gather  (cost=1000.00..194108.60 rows=94056 width=52)
> > >  (actual time=0.570..19268.010 rows=200 loops=1)
> > >Number of Workers: 3
> > >->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056
> > width=52)
> > >   (actual time=0.180..12744.655
> rows=50
> > loops=4)
> > >  Filter: ((id % 20) = 6)
> > >  Rows Removed by Filter: 950
> > >  Foreign File: /tmp/testdata.csv
> > >  Foreign File Size: 1504892535
> > >  Planning time: 0.147 ms
> > >  Execution time: 19330.201 ms
> > > (9 rows)
> >
> > Could you try it not in parallel and then with 1, 2, 3, and 4 workers
> > and post the times for all?
> >
> The above query has 5% selectivity on the entire CSV file.
> Its execution time (total, only ForeignScan) are below
> 
>  total ForeignScandiff
> 0 workers: 17584.319 ms   17555.904 ms  28.415 ms
> 1 workers: 18464.476 ms   18110.968 ms 353.508 ms
> 2 workers: 19042.755 ms   14580.335 ms4462.420 ms
> 3 workers: 19318.254 ms   12668.912 ms6649.342 ms
> 4 workers: 21732.910 ms   13596.788 ms8136.122 ms
> 5 workers: 23486.846 ms   14533.409 ms8953.437 ms
> 
> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
> peak performance on ForeignScan portion. On the other hands, nworkers>1 also
> recorded unignorable time consumption (probably, by Gather node?)
> 
> An interesting observation was, less selectivity (1% and 0%) didn't change the
> result so much. Something consumes CPU time other than file_fdw.
> 
> * selectivity 1%
>total   ForeignScan   diff
> 0 workers: 17573.572 ms   17566.875 ms  6.697 ms
> 1 workers: 18098.070 

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-01-28 Thread Kouhei Kaigai
> On Mon, Jan 25, 2016 at 8:06 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > Sorry for my late response. I've been unavailable to have enough
> > time to touch code for the last 1.5 month.
> >
> > The attached patch is a revised one to handle private data of
> > foregn/custom scan node more gracefully.
> >
> > The overall consensus upthread were:
> > - A new ExtensibleNodeMethods structure defines a unique name
> >   and a set of callbacks to handle node copy, serialization,
> >   deserialization and equality checks.
> > - (Foreign|Custom)(Path|Scan|ScanState) are first host of the
> >   ExtensibleNodeMethods, to allow extension to define larger
> >   structure to store its private fields.
> > - ExtensibleNodeMethods does not support variable length
> >   structure (like a structure with an array on the tail, use
> >   separately allocated array).
> > - ExtensibleNodeMethods shall be registered on _PG_init() of
> >   extensions.
> >
> > The 'pgsql-v9.6-custom-private.v3.patch' is the main part of
> > this feature. As I pointed out before, it uses dynhash instead
> > of the self invented hash table.
> 
> On a first read-through, I see nothing in this patch to which I would
> want to object except for the fact that the comments and documentation
> need some work from a native speaker of English.  It looks like what
> we discussed, and I think it's an improvement over what we have now.
>
Thanks,

Do you think we shall allow to register same extensible node name for
different node types? Like, "GpuJoin" for any of CustomPath, CustomScan
and CustomScanState. Or, do we avoid this using different name for each?

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-01-28 Thread Kouhei Kaigai
> On Thu, Jan 28, 2016 at 10:18 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > Do you think we shall allow to register same extensible node name for
> > different node types? Like, "GpuJoin" for any of CustomPath, CustomScan
> > and CustomScanState. Or, do we avoid this using different name for each?
> 
> I'd say a different name for each.  That's our current convention, and
> I don't see much reason to change it.
>
OK, it is not a serious problem, at least, for my use cases.
A convention like "GpuJoinPath", "GpuJoin" and "GpuJoinState" are sufficient.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CustomScan under the Gather node?

2016-01-27 Thread Kouhei Kaigai
> On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > What enhancement will be necessary to implement similar feature of
> > partial seq-scan using custom-scan interface?
> >
> > It seems to me callbacks on the three points below are needed.
> > * ExecParallelEstimate
> > * ExecParallelInitializeDSM
> > * ExecParallelInitializeWorker
> >
> > Anything else?
> > Does ForeignScan also need equivalent enhancement?
> 
> For postgres_fdw, running the query from a parallel worker would
> change the transaction semantics.  Suppose you begin a transaction,
> UPDATE data on the foreign server, and then run a parallel query.  If
> the leader performs the ForeignScan it will see the uncommitted
> UPDATE, but a worker would have to make its own connection which not
> be part of the same transaction and which would therefore not see the
> update.  That's a problem.
>
Ah, yes, as long as FDW driver ensure the remote session has no
uncommitted data, pg_export_snapshot() might provide us an opportunity,
however, once a session writes something, FDW driver has to prohibit it.

> Also, for postgres_fdw, and many other FDWs I suspect, the assumption
> is that most of the work is being done on the remote side, so doing
> the work in a parallel worker doesn't seem super interesting.  Instead
> of incurring transfer costs to move the data from remote to local, we
> incur two sets of transfer costs: first remote to local, then worker
> to leader.  Ouch.  I think a more promising line of inquiry is to try
> to provide asynchronous execution when we have something like:
> 
> Append
> -> Foreign Scan
> -> Foreign Scan
> 
> ...so that we can return a row from whichever Foreign Scan receives
> data back from the remote server first.
> 
> So it's not impossible that an FDW author could want this, but mostly
> probably not.  I think.
>
Yes, I also have same opinion. Likely, local parallelism is not
valuable for the class of FDWs that obtains data from the remote
server (e.g, postgres_fdw, ...), expect for the case when packing
and unpacking cost over the network is major bottleneck.

On the other hands, it will be valuable for the class of FDW that
performs as a wrapper to local data structure, as like current
partial seq-scan doing. (e.g, file_fdw, ...)
Its data source is not under the transaction control, and 'remote
execution' of these FDWs are eventually executed on the local
computing resources.

If I would make a proof-of-concept patch with interface itself, it
seems to me file_fdw may be a good candidate for this enhancement.
It is not a field for postgres_fdw.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CustomScan under the Gather node?

2016-01-26 Thread Kouhei Kaigai
> -Original Message-
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> Sent: Wednesday, January 27, 2016 2:30 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] CustomScan under the Gather node?
> 
> On Tue, Jan 26, 2016 at 12:00 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> >
> > Hello,
> >
> > What enhancement will be necessary to implement similar feature of
> > partial seq-scan using custom-scan interface?
> >
> > It seems to me callbacks on the three points below are needed.
> > * ExecParallelEstimate
> > * ExecParallelInitializeDSM
> > * ExecParallelInitializeWorker
> >
> > Anything else?
> 
> I don't think so.
> 
> > Does ForeignScan also need equivalent enhancement?
> 
> I think this depends on the way ForeignScan is supposed to be
> parallelized, basically if it needs to coordinate any information
> with other set of workers, then it will require such an enhancement.
>
After the post yesterday, I reminded an possible scenario around FDW
if it manages own private storage, like cstore_fdw.

Probably, ForeignScan node performing on columnar store (for example)
will need a coordination information like as partial seq-scan doing.
It is a case very similar to the implementation on local storage.

On the other hands, if we try postgres_fdw (or others) to get parallelized
with background worker, I doubt whether we need this coordination information
on local side. Remote query will have an additional qualifier to skip blocks
already fetched for this purpose.
At least, it does not needs something special enhancement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] CustomScan under the Gather node?

2016-01-25 Thread Kouhei Kaigai
Hello,

What enhancement will be necessary to implement similar feature of
partial seq-scan using custom-scan interface?

It seems to me callbacks on the three points below are needed.
* ExecParallelEstimate
* ExecParallelInitializeDSM
* ExecParallelInitializeWorker

Anything else?
Does ForeignScan also need equivalent enhancement?



Background of my motivation is the slides below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using peer-to-peer DMA, prior to
data loading onto CPU/RAM. (Probably, it shall be loaded only
all-visible blocks like as index-only scan.)
Once we load the data blocks onto GPU RAM, we can reduce rows
to be filtered out later but consumes CPU RAM.
An expected major bottleneck is CPU thread which issues the
peer-to-peer DMA requests to the device, rather than GPU tasks.
So, utilization of parallel execution is a natural thought.
However, a CustomScan node that takes underlying PartialSeqScan
node is not sufficient because it once loads the data blocks
onto CPU RAM. P2P DMA does not make sense.

The expected "GpuSsdScan" on CustomScan will reference a shared
block-index to be incremented by multiple backend, then it
enqueues P2P DMA request (if all visible) to the device driver.
Then it receives the rows only visible towards the scan qualifiers.
It is almost equivalent to SeqScan, but wants to bypass heap layer
to utilize SSD-to-GPU direct data translation path.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-01-25 Thread Kouhei Kaigai
Sorry for my late response. I've been unavailable to have enough
time to touch code for the last 1.5 month.

The attached patch is a revised one to handle private data of
foregn/custom scan node more gracefully.

The overall consensus upthread were:
- A new ExtensibleNodeMethods structure defines a unique name
  and a set of callbacks to handle node copy, serialization,
  deserialization and equality checks.
- (Foreign|Custom)(Path|Scan|ScanState) are first host of the
  ExtensibleNodeMethods, to allow extension to define larger
  structure to store its private fields.
- ExtensibleNodeMethods does not support variable length
  structure (like a structure with an array on the tail, use
  separately allocated array).
- ExtensibleNodeMethods shall be registered on _PG_init() of
  extensions.

The 'pgsql-v9.6-custom-private.v3.patch' is the main part of
this feature. As I pointed out before, it uses dynhash instead
of the self invented hash table.

Interfaces are defined as follows (not changed from v2):

  typedef struct ExtensibleNodeMethods
  {
 const char *extnodename;
 Sizenode_size;
 void  (*nodeCopy)(Node *newnode, const Node *oldnode);
 bool  (*nodeEqual)(const Node *a, const Node *b);
 void  (*nodeOut)(struct StringInfoData *str, const Node *node);
 void  (*nodeRead)(Node *node);
  } ExtensibleNodeMethods;
  
  extern void
  RegisterExtensibleNodeMethods(const ExtensibleNodeMethods *methods);
  
  extern const ExtensibleNodeMethods *
  GetExtensibleNodeMethods(const char *extnodename, bool missing_ok);


Also, 'extensible-node-example-on-pgstrom.patch' is a working
example on its "GpuScan" node.
The code below uses all of copy, serialization and deserialization.

gscan = (GpuScan *)stringToNode(nodeToString(copyObject(cscan)));
elog(INFO, "GpuScan: %s", nodeToString(gscan));

Then, I could confirm private fields are reproduced correctly.

In addition to this, I'd like to suggest two small improvement.

On nodeOut callback, extension will need _outToken() and _outBitmap(),
however, these two functions are static. Entrypoint for extensions
are needed. (Of course, extension can copy & paste these small functions...)

ExtensibleNodeMethods may be registered with a unique pair of its
name and node-tag which is associated with. The current code requires
the name is unique to others, however, it may make a bit inconvenience.
In case of CustomScan, extension need to define three nodes: CustomPath,
CustomScan and CustomScanState, thus, ExtensibleNodeMethods which is
associated with these node must have individually unique name, like
"GpuScanPath", "GpuScan" and "GpuScanState".
If extnodename would be unique within a particular node type, we can
apply same name for all of the three above.

How about your thought?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

> -Original Message-
> From: Kaigai Kouhei(海外 浩平)
> Sent: Wednesday, December 02, 2015 5:52 PM
> To: 'Robert Haas'
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan 
> support
> on readfuncs.c)
> 
> > On Thu, Nov 26, 2015 at 5:27 AM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > > I'm now implementing. The above design perfectly works on ForeignScan.
> > > On the other hands, I'd like to have deeper consideration for CustomScan.
> > >
> > > My recent patch adds LibraryName and SymbolName on CustomScanMethods
> > > to lookup the method table even if library is not loaded yet.
> > > However, this ExtensibleNodeMethods relies custom scan provider shall
> > > be loaded, by parallel infrastructure, prior to the deserialization.
> > > It means extension has a chance to register itself as well.
> > >
> > > My idea is, redefine CustomScanMethod as follows:
> > >
> > > typedef struct ExtensibleNodeMethods
> > > {
> > > const char *extnodename;
> > > Sizenode_size;
> > > Node *(*nodeCopy)(const Node *from);
> > > bool  (*nodeEqual)(const Node *a, const Node *b);
> > > void  (*nodeOut)(struct StringInfoData *str, const Node *node);
> > > void  (*nodeRead)(Node *node);
> > > } ExtensibleNodeMethods;
> > >
> > > typedef struct CustomScanMethods
> > > {
> > > union {
> > > const char *CustomName;
> > > ExtensibleNodeMethods  xnode;
> > > };
> > > /* Create execution state (CustomScanState) from a CustomScan plan 
> > > node
> > */
> > > Node   *(*CreateCustomScanState) (struct CustomScan *cscan);
> > > } CustomScanMethods;

  1   2   3   4   5   >