Re: Parallel bitmap index scan

2020-12-22 Thread Dilip Kumar
On Wed, 23 Dec 2020 at 4:15 AM, Tomas Vondra 
wrote:

> On 11/11/20 8:52 PM, Tomas Vondra wrote:
> > Hi,
> >
> > I took a look at this today, doing a bit of stress-testing, and I can
> > get it to crash because of segfaults in pagetable_create (not sure if
> > the issue is there, it might be just a symptom of an issue elsewhere).
> >
> > Attached is a shell script I use to run the stress test - it's using
> > 'test' database, generates tables of different size and then runs
> > queries with various parameter combinations. It takes a while to trigger
> > the crash, so it might depend on timing or something like that.
> >
> > I've also attached two examples of backtraces. I've also seen infinite
> > loop in pagetable_create, but the crashes are much more common.
> >
>
> Hi Dilip,
>
> Do you plan to work on this for PG14? I haven't noticed any response in
> this thread, dealing with the crashes I reported a while ago. Also, it
> doesn't seem to be added to any of the commitfests.



Hi Tomas,

Thanks for testing this.  Actually we have noticed a lot of performance
drop in many cases due to the tbm_merge.  So off list we are discussing
different approaches and testing the performance.  So basically, in the
current approach all the worker are first preparing their bitmap hash and
then they are merging into the common bitmap hash under a lock.  So based
on the off list discussion with Robert, the next approach I am trying is to
directly insert into the shared bitmap hash while scanning the index
itself.  So now instead of preparing a separate bitmap, all the workers
will directly insert into the shared bitmap hash.  I agree that for getting
each page from the bitmaphash we need to acquire the lock and this also
might generate a lot of lock contention but we want to try the POC and
check the performance.  In fact I have already implemented the POC and
results aren't great.  But I am still experimenting with it to see whether
the lock can be more granular than I have now.  I will share my finding
soon along with the POC patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

>


Re: Parallel bitmap index scan

2020-12-22 Thread Tomas Vondra
On 11/11/20 8:52 PM, Tomas Vondra wrote:
> Hi,
> 
> I took a look at this today, doing a bit of stress-testing, and I can
> get it to crash because of segfaults in pagetable_create (not sure if
> the issue is there, it might be just a symptom of an issue elsewhere).
> 
> Attached is a shell script I use to run the stress test - it's using
> 'test' database, generates tables of different size and then runs
> queries with various parameter combinations. It takes a while to trigger
> the crash, so it might depend on timing or something like that.
> 
> I've also attached two examples of backtraces. I've also seen infinite
> loop in pagetable_create, but the crashes are much more common.
> 

Hi Dilip,

Do you plan to work on this for PG14? I haven't noticed any response in
this thread, dealing with the crashes I reported a while ago. Also, it
doesn't seem to be added to any of the commitfests.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Parallel bitmap index scan

2020-11-11 Thread Tomas Vondra
Hi,

I took a look at this today, doing a bit of stress-testing, and I can
get it to crash because of segfaults in pagetable_create (not sure if
the issue is there, it might be just a symptom of an issue elsewhere).

Attached is a shell script I use to run the stress test - it's using
'test' database, generates tables of different size and then runs
queries with various parameter combinations. It takes a while to trigger
the crash, so it might depend on timing or something like that.

I've also attached two examples of backtraces. I've also seen infinite
loop in pagetable_create, but the crashes are much more common.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Core was generated by `postgres: postgres test [local] SELECT   
 '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  MemoryContextAllocZero (context=context@entry=0x561c0598cf40, 
size=size@entry=48) at mcxt.c:852
852 ret = context->methods->alloc(context, size);
(gdb) bt
#0  MemoryContextAllocZero (context=context@entry=0x561c0598cf40, 
size=size@entry=48) at mcxt.c:852
#1  0x561c0467b95c in pagetable_create (nelements=128, 
private_data=0x7f155f7ef000, ctx=0x561c0598cf40) at 
../../../src/include/lib/simplehash.h:457
#2  tbm_create_pagetable (tbm=tbm@entry=0x7f155f7ef000) at tidbitmap.c:296
#3  0x561c0467bae3 in tbm_get_pageentry (tbm=tbm@entry=0x7f155f7ef000, 
pageno=3779) at tidbitmap.c:1303
#4  0x561c0467bed5 in tbm_union_page (a=a@entry=0x7f155f7ef000, 
bpage=0x7f155f7e21b8) at tidbitmap.c:514
#5  0x561c0467c5a0 in tbm_union (b=0x561c059cd468, a=0x7f155f7ef000) at 
tidbitmap.c:474
#6  tbm_union (a=0x7f155f7ef000, b=0x561c059cd468) at tidbitmap.c:457
#7  0x561c0467cb77 in tbm_merge (tbm=0x561c059cd468, dp_tbm=, dp_pagetable=0x7f155f8d4418) at tidbitmap.c:822
#8  0x561c0461c80f in BitmapHeapNext (node=node@entry=0x561c05996400) at 
nodeBitmapHeapscan.c:228
#9  0x561c0460f611 in ExecScanFetch (recheckMtd=0x561c0461bf10 
, accessMtd=0x561c0461bfa0 , 
node=0x561c05996400)
at execScan.c:133
#10 ExecScan (node=0x561c05996400, accessMtd=0x561c0461bfa0 , 
recheckMtd=0x561c0461bf10 ) at execScan.c:182
#11 0x561c04615d31 in ExecProcNode (node=0x561c05996400) at 
../../../src/include/executor/executor.h:247
#12 fetch_input_tuple (aggstate=aggstate@entry=0x561c05995d98) at nodeAgg.c:589
#13 0x561c046188c8 in agg_retrieve_direct (aggstate=) at 
nodeAgg.c:2356
#14 ExecAgg (pstate=) at nodeAgg.c:2171
#15 0x561c0461f487 in ExecProcNode (node=0x561c05995d98) at 
../../../src/include/executor/executor.h:247
#16 gather_getnext (gatherstate=0x561c05995bf8) at nodeGather.c:295
#17 ExecGather (pstate=0x561c05995bf8) at nodeGather.c:227
#18 0x561c04615d31 in ExecProcNode (node=0x561c05995bf8) at 
../../../src/include/executor/executor.h:247
#19 fetch_input_tuple (aggstate=aggstate@entry=0x561c059955d0) at nodeAgg.c:589
#20 0x561c046188c8 in agg_retrieve_direct (aggstate=) at 
nodeAgg.c:2356
#21 ExecAgg (pstate=) at nodeAgg.c:2171
#22 0x561c04606b4b in ExecProcNode (node=0x561c059955d0) at 
../../../src/include/executor/executor.h:247
#23 ExecutePlan (execute_once=, dest=0x561c059c6550, 
direction=, numberTuples=0, sendTuples=, 
operation=CMD_SELECT, use_parallel_mode=, 
planstate=0x561c059955d0, estate=0x561c05995378) at execMain.c:1542
#24 standard_ExecutorRun (queryDesc=0x561c058fdac8, direction=, 
count=0, execute_once=) at execMain.c:364
#25 0x561c0475f39c in PortalRunSelect (portal=0x561c0593f758, 
forward=, count=0, dest=) at pquery.c:912
#26 0x561c04760546 in PortalRun (portal=portal@entry=0x561c0593f758, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
run_once=run_once@entry=true, dest=dest@entry=0x561c059c6550, 
altdest=altdest@entry=0x561c059c6550, qc=0x7ffd52084c80) at pquery.c:756
#27 0x561c0475c33c in exec_simple_query (query_string=0x561c058dbce8 
"select count(b)from t where a between 3387 and 4027;") at postgres.c:1239
#28 0x561c0475dee0 in PostgresMain (argc=argc@entry=1, 
argv=argv@entry=0x7ffd520850e0, dbname=, username=)
at postgres.c:4305
#29 0x561c046e709c in BackendRun (port=, port=) at postmaster.c:4488
#30 BackendStartup (port=) at postmaster.c:4210
#31 ServerLoop () at postmaster.c:1727
#32 0x561c046e7f3f in PostmasterMain (argc=, 
argv=0x561c058d6550) at postmaster.c:1400
#33 0x561c0448d970 in main (argc=3, argv=0x561c058d6550) at main.c:209


Core was generated by `postgres: postgres test [local] EXPLAIN  
 '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  MemoryContextAllocZero (context=context@entry=0x561c0598ed90, 
size=size@entry=48) at mcxt.c:852
852 ret = context->methods->alloc(context, size);
(gdb) bt
#0  MemoryContextAllocZero (context=context@entry=0x561c0598ed90, 
size=size@entry=48) at mcxt.c:852
#1  0x561c0467b95c in 

Re: Parallel bitmap index scan

2020-09-21 Thread Dilip Kumar
On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar  wrote:
>
> I would like to propose a patch for enabling the parallelism for the
> bitmap index scan path.
>
> Background:
> Currently, we support only a parallel bitmap heap scan path.  Therein,
> the underlying bitmap index scan is done by a single worker called the
> leader.  The leader creates a bitmap in shared memory and once the
> bitmap is ready it creates a shared iterator and after that, all the
> workers process the shared iterator and scan the heap in parallel.
> While analyzing the TPCH plan we have observed that some of the
> queries are spending significant time in preparing the bitmap.  So the
> idea of this patch is to use the parallel index scan for preparing the
> underlying bitmap in parallel.
>
> Design:
> If underlying index AM supports the parallel path (currently only
> BTREE support it), then we will create a parallel bitmap heap scan
> path on top of the parallel bitmap index scan path.  So the idea of
> this patch is that each worker will do the parallel index scan and
> generate their part of the bitmap.  And, we will create a barrier so
> that we can not start preparing the shared iterator until all the
> worker is ready with their bitmap.  The first worker, which is ready
> with the bitmap will keep a copy of its TBM and the page table in the
> shared memory.  And, all the subsequent workers will merge their TBM
> with the shared TBM.  Once all the TBM are merged we will get one
> common shared TBM and after that stage, the worker can continue.  The
> remaining part is the same,  basically, again one worker will scan the
> shared TBM and prepare the shared iterator and once it is ready all
> the workers will jointly scan the heap in parallel using shared
> iterator.
>
> BitmapHeapNext
> {
> ...
> BarrierAttach();
> tbm = MultiExecProcNode();
> tbm_merge(tbm); --Merge with common tbm using tbm_union
> BarrierArriveAndWait();
>
> if (BitmapShouldInitializeSharedState(pstate)). --> only one worker
> come out of this
> {
>tbm_prepare_shared_iterate();
>BitmapDoneInitializingSharedState(). -->wakeup others
> }
> tbm_attach_shared_iterate(). --> all worker attach to shared iterator
> ...
> }
>
> Performance:  With scale factor 10, I could see that Q6 is spending
> significant time in a bitmap index scan so I have taken the
> performance with that query and I can see that the bitmap index scan
> node is 3x faster by using 3 workers whereas overall plan got ~40%
> faster.
>
> TPCH: S.F. 10, work_mem=512MB  shared_buffers: 1GB
>
> HEAD:
>   Limit  (cost=1559777.02..1559777.03 rows=1 width=32) (actual
> time=5260.121..5260.122 rows=1 loops=1)
>->  Finalize Aggregate  (cost=1559777.02..1559777.03 rows=1
> width=32) (actual time=5260.119..5260.119 rows=1 loops=1)
>  ->  Gather  (cost=1559776.69..1559777.00 rows=3 width=32)
> (actual time=5257.251..5289.595 rows=4 loops=1)
>Workers Planned: 3
>Workers Launched: 3
>->  Partial Aggregate  (cost=1558776.69..1558776.70
> rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4)
>  ->  Parallel Bitmap Heap Scan on lineitem
> (cost=300603.01..1556898.89 rows=375560 width=12) (actual
> time=3475.944..50
> 37.484 rows=285808 loops=4)
>Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>Heap Blocks: exact=205250
>->  Bitmap Index Scan on
> idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
> (actual time=3169.85
> 5..3169.855 rows=1143234 loops=1)
>  Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
>  time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>  Planning Time: 0.659 ms
>  Execution Time: 5289.787 ms
> (13 rows)
>
>
> PATCH:
>
>   Limit  (cost=1559579.85..1559579.86 rows=1 width=32) (actual
> time=.572...572 rows=1 loops=1)
>->  Finalize Aggregate  (cost=1559579.85..1559579.86 rows=1
> width=32) (actual time=.569...569 rows=1 loops=1)
>  ->  Gather  (cost=1559579.52..1559579.83 rows=3 width=32)
> (actual time=3328.619..3365.227 rows=4 loops=1)
>Workers Planned: 3
>Workers Launched: 3
>->  Partial Aggregate  (cost=1558579.52..1558579.53
> rows=1 width=32)

Re: Parallel bitmap index scan

2020-08-17 Thread Dilip Kumar
On Mon, 17 Aug 2020 at 7:42 PM, Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Sun, Jul 26, 2020 at 6:43 PM Dilip Kumar  wrote:
>
> >
>
> > I would like to propose a patch for enabling the parallelism for the
>
> > bitmap index scan path.
>
> >
>
> > Background:
>
> > Currently, we support only a parallel bitmap heap scan path.  Therein,
>
> > the underlying bitmap index scan is done by a single worker called the
>
> > leader.  The leader creates a bitmap in shared memory and once the
>
> > bitmap is ready it creates a shared iterator and after that, all the
>
> > workers process the shared iterator and scan the heap in parallel.
>
> > While analyzing the TPCH plan we have observed that some of the
>
> > queries are spending significant time in preparing the bitmap.  So the
>
> > idea of this patch is to use the parallel index scan for preparing the
>
> > underlying bitmap in parallel.
>
> >
>
> > Design:
>
> > If underlying index AM supports the parallel path (currently only
>
> > BTREE support it), then we will create a parallel bitmap heap scan
>
> > path on top of the parallel bitmap index scan path.  So the idea of
>
> > this patch is that each worker will do the parallel index scan and
>
> > generate their part of the bitmap.  And, we will create a barrier so
>
> > that we can not start preparing the shared iterator until all the
>
> > worker is ready with their bitmap.  The first worker, which is ready
>
> > with the bitmap will keep a copy of its TBM and the page table in the
>
> > shared memory.  And, all the subsequent workers will merge their TBM
>
> > with the shared TBM.  Once all the TBM are merged we will get one
>
> > common shared TBM and after that stage, the worker can continue.  The
>
> > remaining part is the same,  basically, again one worker will scan the
>
> > shared TBM and prepare the shared iterator and once it is ready all
>
> > the workers will jointly scan the heap in parallel using shared
>
> > iterator.
>
> >
>
>
>
> Though I have not looked at the patch or code for the existing
>
> parallel bitmap heap scan, one point keeps bugging in my mind. I may
>
> be utterly wrong or my question may be so silly, anyways I would like
>
> to ask here:
>
>
>
> From the above design: each parallel worker creates partial bitmaps
>
> for the index data that they looked at. Why should they merge these
>
> bitmaps to a single bitmap in shared memory? Why can't each parallel
>
> worker do a bitmap heap scan using the partial bitmaps they built
>
> during it's bitmap index scan and emit qualified tuples/rows so that
>
> the gather node can collect them? There may not be even lock
>
> contention as bitmap heap scan takes read locks for the heap
>
> pages/tuples.


The main reason is that there could be lossy pages in bitmap and if that is
the case then there will be duplicate data.  Maybe if there is no lossy
data in any of the bitmap we might do as u describe but still I think that
it is very much possible that different bitmap might have many common heap
pages because bitmap is prepared using index scan.  And in such cases we
will be doing duplicate i/o.

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel bitmap index scan

2020-08-17 Thread Bharath Rupireddy
On Sun, Jul 26, 2020 at 6:43 PM Dilip Kumar  wrote:
>
> I would like to propose a patch for enabling the parallelism for the
> bitmap index scan path.
>
> Background:
> Currently, we support only a parallel bitmap heap scan path.  Therein,
> the underlying bitmap index scan is done by a single worker called the
> leader.  The leader creates a bitmap in shared memory and once the
> bitmap is ready it creates a shared iterator and after that, all the
> workers process the shared iterator and scan the heap in parallel.
> While analyzing the TPCH plan we have observed that some of the
> queries are spending significant time in preparing the bitmap.  So the
> idea of this patch is to use the parallel index scan for preparing the
> underlying bitmap in parallel.
>
> Design:
> If underlying index AM supports the parallel path (currently only
> BTREE support it), then we will create a parallel bitmap heap scan
> path on top of the parallel bitmap index scan path.  So the idea of
> this patch is that each worker will do the parallel index scan and
> generate their part of the bitmap.  And, we will create a barrier so
> that we can not start preparing the shared iterator until all the
> worker is ready with their bitmap.  The first worker, which is ready
> with the bitmap will keep a copy of its TBM and the page table in the
> shared memory.  And, all the subsequent workers will merge their TBM
> with the shared TBM.  Once all the TBM are merged we will get one
> common shared TBM and after that stage, the worker can continue.  The
> remaining part is the same,  basically, again one worker will scan the
> shared TBM and prepare the shared iterator and once it is ready all
> the workers will jointly scan the heap in parallel using shared
> iterator.
>

Though I have not looked at the patch or code for the existing
parallel bitmap heap scan, one point keeps bugging in my mind. I may
be utterly wrong or my question may be so silly, anyways I would like
to ask here:

>From the above design: each parallel worker creates partial bitmaps
for the index data that they looked at. Why should they merge these
bitmaps to a single bitmap in shared memory? Why can't each parallel
worker do a bitmap heap scan using the partial bitmaps they built
during it's bitmap index scan and emit qualified tuples/rows so that
the gather node can collect them? There may not be even lock
contention as bitmap heap scan takes read locks for the heap
pages/tuples.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel bitmap index scan

2020-07-28 Thread Dilip Kumar
On Mon, Jul 27, 2020 at 3:48 AM Thomas Munro  wrote:
>
> On Mon, Jul 27, 2020 at 1:58 AM Dilip Kumar  wrote:
> > On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar  wrote:
> > >
> > > I would like to propose a patch for enabling the parallelism for the
> > > bitmap index scan path.
>
> Workers Planned: 4
> ->  Parallel Bitmap Heap Scan on tenk1
>   Recheck Cond: (hundred > 1)
> - ->  Bitmap Index Scan on tenk1_hundred
> + ->  Parallel Bitmap Index Scan on tenk1_hundred
> Index Cond: (hundred > 1)
>
> +1, this is a very good feature to have.
>
> +   /* Merge bitmap to a common
> shared bitmap */
> +   SpinLockAcquire(>mutex);
> +   tbm_merge(tbm,
> >tbm_shared, >pt_shared);
> +   SpinLockRelease(>mutex);
>
> This doesn't look like a job for a spinlock.
>
> You have a barrier so that you can wait until all workers have
> finished merging their partial bitmap into the complete bitmap, which
> makes perfect sense.  You also use that spinlock (probably should be
> LWLock) to serialise the bitmap merging work...  Hmm, I suppose there
> would be an alternative design which also uses the barrier to
> serialise the merge, and has the merging done entirely by one process,
> like this:
>
>   bool chosen_to_merge = false;
>
>   /* Attach to the barrier, and see what phase we're up to. */
>   switch (BarrierAttach())
>   {
>   case PBH_BUILDING:
> ... build my partial bitmap in shmem ...
> chosen_to_merge = BarrierArriveAndWait();
> /* Fall through */
>
>   case PBH_MERGING:
> if (chosen_to_merge)
>   ... perform merge of all partial results into final shmem bitmap ...
> BarrierArriveAndWait();
> /* Fall through */
>
>   case PBH_SCANNING:
> /* We attached too late to help build the bitmap.  */
> BarrierDetach();
> break;
>   }
>
> Just an idea, not sure if it's a good one.  I find it a little easier
> to reason about the behaviour of late-attaching workers when the
> phases are explicitly named and handled with code like that (it's not
> immediately clear to me whether your coding handles late attachers
> correctly, which seems to be one of the main problems to think about
> with "dynamic party" parallelism...).

Actually, for merging, I could not use the strategy you suggested
because in this case all the worker prepare their TBM and merge to the
shared TBM.  Basically, we don't need to choose a leader for that all
the workers need to merge their TBM to the shared location but one at
a time, and also we don't need to wait for all the workers to prepare
TBM before they start merging.  However, once the merge is over we
need to wait for all the workers to finish the merge and after that
only one worker will be allowed to prepare the shared iterator.  So
for that, I have used your idea of the barrier phase.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v3-0001-POC-Parallel-Bitmap-Index-Scan.patch
Description: Binary data


Re: Parallel bitmap index scan

2020-07-26 Thread Dilip Kumar
On Mon, 27 Jul 2020 at 3:48 AM, Thomas Munro  wrote:

> On Mon, Jul 27, 2020 at 1:58 AM Dilip Kumar  wrote:
> > On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar 
> wrote:
> > >
> > > I would like to propose a patch for enabling the parallelism for the
> > > bitmap index scan path.
>
> Workers Planned: 4
> ->  Parallel Bitmap Heap Scan on tenk1
>   Recheck Cond: (hundred > 1)
> - ->  Bitmap Index Scan on tenk1_hundred
> + ->  Parallel Bitmap Index Scan on tenk1_hundred
> Index Cond: (hundred > 1)
>
> +1, this is a very good feature to have.
>
> +   /* Merge bitmap to a common
> shared bitmap */
> +   SpinLockAcquire(>mutex);
> +   tbm_merge(tbm,
> >tbm_shared, >pt_shared);
> +   SpinLockRelease(>mutex);
>
> This doesn't look like a job for a spinlock.


Yes I agree with that.

You have a barrier so that you can wait until all workers have
> finished merging their partial bitmap into the complete bitmap, which
> makes perfect sense.  You also use that spinlock (probably should be
> LWLock) to serialise the bitmap merging work...  Hmm, I suppose there
> would be an alternative design which also uses the barrier to
> serialise the merge, and has the merging done entirely by one process,
> like this:
>
>   bool chosen_to_merge = false;
>
>   /* Attach to the barrier, and see what phase we're up to. */
>   switch (BarrierAttach())
>   {
>   case PBH_BUILDING:
> ... build my partial bitmap in shmem ...
> chosen_to_merge = BarrierArriveAndWait();
> /* Fall through */
>
>   case PBH_MERGING:
> if (chosen_to_merge)
>   ... perform merge of all partial results into final shmem bitmap ...
> BarrierArriveAndWait();
> /* Fall through */
>
>   case PBH_SCANNING:
> /* We attached too late to help build the bitmap.  */
> BarrierDetach();
> break;
>   }
>
> Just an idea, not sure if it's a good one.  I find it a little easier
> to reason about the behaviour of late-attaching workers when the
> phases are explicitly named and handled with code like that (it's not
> immediately clear to me whether your coding handles late attachers
> correctly, which seems to be one of the main problems to think about
> with "dynamic party" parallelism...).


Yeah this seems better idea.  I am handling late attachers case but the
idea of using the barrier phase looks quite clean.  I will change it this
way.

> --
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel bitmap index scan

2020-07-26 Thread Thomas Munro
On Mon, Jul 27, 2020 at 1:58 AM Dilip Kumar  wrote:
> On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar  wrote:
> >
> > I would like to propose a patch for enabling the parallelism for the
> > bitmap index scan path.

Workers Planned: 4
->  Parallel Bitmap Heap Scan on tenk1
  Recheck Cond: (hundred > 1)
- ->  Bitmap Index Scan on tenk1_hundred
+         ->  Parallel Bitmap Index Scan on tenk1_hundred
Index Cond: (hundred > 1)

+1, this is a very good feature to have.

+   /* Merge bitmap to a common
shared bitmap */
+   SpinLockAcquire(>mutex);
+   tbm_merge(tbm,
>tbm_shared, >pt_shared);
+   SpinLockRelease(>mutex);

This doesn't look like a job for a spinlock.

You have a barrier so that you can wait until all workers have
finished merging their partial bitmap into the complete bitmap, which
makes perfect sense.  You also use that spinlock (probably should be
LWLock) to serialise the bitmap merging work...  Hmm, I suppose there
would be an alternative design which also uses the barrier to
serialise the merge, and has the merging done entirely by one process,
like this:

  bool chosen_to_merge = false;

  /* Attach to the barrier, and see what phase we're up to. */
  switch (BarrierAttach())
  {
  case PBH_BUILDING:
... build my partial bitmap in shmem ...
chosen_to_merge = BarrierArriveAndWait();
/* Fall through */

  case PBH_MERGING:
if (chosen_to_merge)
  ... perform merge of all partial results into final shmem bitmap ...
BarrierArriveAndWait();
/* Fall through */

  case PBH_SCANNING:
/* We attached too late to help build the bitmap.  */
BarrierDetach();
break;
  }

Just an idea, not sure if it's a good one.  I find it a little easier
to reason about the behaviour of late-attaching workers when the
phases are explicitly named and handled with code like that (it's not
immediately clear to me whether your coding handles late attachers
correctly, which seems to be one of the main problems to think about
with "dynamic party" parallelism...).




Re: Parallel bitmap index scan

2020-07-26 Thread Dilip Kumar
On Sun, Jul 26, 2020 at 6:42 PM Dilip Kumar  wrote:
>
> I would like to propose a patch for enabling the parallelism for the
> bitmap index scan path.
>
> Background:
> Currently, we support only a parallel bitmap heap scan path.  Therein,
> the underlying bitmap index scan is done by a single worker called the
> leader.  The leader creates a bitmap in shared memory and once the
> bitmap is ready it creates a shared iterator and after that, all the
> workers process the shared iterator and scan the heap in parallel.
> While analyzing the TPCH plan we have observed that some of the
> queries are spending significant time in preparing the bitmap.  So the
> idea of this patch is to use the parallel index scan for preparing the
> underlying bitmap in parallel.
>
> Design:
> If underlying index AM supports the parallel path (currently only
> BTREE support it), then we will create a parallel bitmap heap scan
> path on top of the parallel bitmap index scan path.  So the idea of
> this patch is that each worker will do the parallel index scan and
> generate their part of the bitmap.  And, we will create a barrier so
> that we can not start preparing the shared iterator until all the
> worker is ready with their bitmap.  The first worker, which is ready
> with the bitmap will keep a copy of its TBM and the page table in the
> shared memory.  And, all the subsequent workers will merge their TBM
> with the shared TBM.  Once all the TBM are merged we will get one
> common shared TBM and after that stage, the worker can continue.  The
> remaining part is the same,  basically, again one worker will scan the
> shared TBM and prepare the shared iterator and once it is ready all
> the workers will jointly scan the heap in parallel using shared
> iterator.
>
> BitmapHeapNext
> {
> ...
> BarrierAttach();
> tbm = MultiExecProcNode();
> tbm_merge(tbm); --Merge with common tbm using tbm_union
> BarrierArriveAndWait();
>
> if (BitmapShouldInitializeSharedState(pstate)). --> only one worker
> come out of this
> {
>tbm_prepare_shared_iterate();
>BitmapDoneInitializingSharedState(). -->wakeup others
> }
> tbm_attach_shared_iterate(). --> all worker attach to shared iterator
> ...
> }
>
> Performance:  With scale factor 10, I could see that Q6 is spending
> significant time in a bitmap index scan so I have taken the
> performance with that query and I can see that the bitmap index scan
> node is 3x faster by using 3 workers whereas overall plan got ~40%
> faster.
>
> TPCH: S.F. 10, work_mem=512MB  shared_buffers: 1GB
>
> HEAD:
>   Limit  (cost=1559777.02..1559777.03 rows=1 width=32) (actual
> time=5260.121..5260.122 rows=1 loops=1)
>->  Finalize Aggregate  (cost=1559777.02..1559777.03 rows=1
> width=32) (actual time=5260.119..5260.119 rows=1 loops=1)
>  ->  Gather  (cost=1559776.69..1559777.00 rows=3 width=32)
> (actual time=5257.251..5289.595 rows=4 loops=1)
>Workers Planned: 3
>Workers Launched: 3
>->  Partial Aggregate  (cost=1558776.69..1558776.70
> rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4)
>  ->  Parallel Bitmap Heap Scan on lineitem
> (cost=300603.01..1556898.89 rows=375560 width=12) (actual
> time=3475.944..50
> 37.484 rows=285808 loops=4)
>Recheck Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without tim
> e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>Heap Blocks: exact=205250
>->  Bitmap Index Scan on
> idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
> (actual time=3169.85
> 5..3169.855 rows=1143234 loops=1)
>  Index Cond: ((l_shipdate >=
> '1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
> without
>  time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
> (l_quantity < '24'::numeric))
>  Planning Time: 0.659 ms
>  Execution Time: 5289.787 ms
> (13 rows)
>
>
> PATCH:
>
>   Limit  (cost=1559579.85..1559579.86 rows=1 width=32) (actual
> time=.572...572 rows=1 loops=1)
>->  Finalize Aggregate  (cost=1559579.85..1559579.86 rows=1
> width=32) (actual time=.569...569 rows=1 loops=1)
>  ->  Gather  (cost=1559579.52..1559579.83 rows=3 width=32)
> (actual time=3328.619..3365.227 rows=4 loops=1)
>Workers Planned: 3
>Workers Launched: 3
>->  Partial Aggregate  (cost=1558579.52..1558579.53
> rows=1 width=32)

Parallel bitmap index scan

2020-07-26 Thread Dilip Kumar
I would like to propose a patch for enabling the parallelism for the
bitmap index scan path.

Background:
Currently, we support only a parallel bitmap heap scan path.  Therein,
the underlying bitmap index scan is done by a single worker called the
leader.  The leader creates a bitmap in shared memory and once the
bitmap is ready it creates a shared iterator and after that, all the
workers process the shared iterator and scan the heap in parallel.
While analyzing the TPCH plan we have observed that some of the
queries are spending significant time in preparing the bitmap.  So the
idea of this patch is to use the parallel index scan for preparing the
underlying bitmap in parallel.

Design:
If underlying index AM supports the parallel path (currently only
BTREE support it), then we will create a parallel bitmap heap scan
path on top of the parallel bitmap index scan path.  So the idea of
this patch is that each worker will do the parallel index scan and
generate their part of the bitmap.  And, we will create a barrier so
that we can not start preparing the shared iterator until all the
worker is ready with their bitmap.  The first worker, which is ready
with the bitmap will keep a copy of its TBM and the page table in the
shared memory.  And, all the subsequent workers will merge their TBM
with the shared TBM.  Once all the TBM are merged we will get one
common shared TBM and after that stage, the worker can continue.  The
remaining part is the same,  basically, again one worker will scan the
shared TBM and prepare the shared iterator and once it is ready all
the workers will jointly scan the heap in parallel using shared
iterator.

BitmapHeapNext
{
...
BarrierAttach();
tbm = MultiExecProcNode();
tbm_merge(tbm); --Merge with common tbm using tbm_union
BarrierArriveAndWait();

if (BitmapShouldInitializeSharedState(pstate)). --> only one worker
come out of this
{
   tbm_prepare_shared_iterate();
   BitmapDoneInitializingSharedState(). -->wakeup others
}
tbm_attach_shared_iterate(). --> all worker attach to shared iterator
...
}

Performance:  With scale factor 10, I could see that Q6 is spending
significant time in a bitmap index scan so I have taken the
performance with that query and I can see that the bitmap index scan
node is 3x faster by using 3 workers whereas overall plan got ~40%
faster.

TPCH: S.F. 10, work_mem=512MB  shared_buffers: 1GB

HEAD:
  Limit  (cost=1559777.02..1559777.03 rows=1 width=32) (actual
time=5260.121..5260.122 rows=1 loops=1)
   ->  Finalize Aggregate  (cost=1559777.02..1559777.03 rows=1
width=32) (actual time=5260.119..5260.119 rows=1 loops=1)
 ->  Gather  (cost=1559776.69..1559777.00 rows=3 width=32)
(actual time=5257.251..5289.595 rows=4 loops=1)
   Workers Planned: 3
   Workers Launched: 3
   ->  Partial Aggregate  (cost=1558776.69..1558776.70
rows=1 width=32) (actual time=5247.714..5247.714 rows=1 loops=4)
 ->  Parallel Bitmap Heap Scan on lineitem
(cost=300603.01..1556898.89 rows=375560 width=12) (actual
time=3475.944..50
37.484 rows=285808 loops=4)
   Recheck Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without tim
e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
   Heap Blocks: exact=205250
   ->  Bitmap Index Scan on
idx_lineitem_shipdate  (cost=0.00..300311.95 rows=1164235 width=0)
(actual time=3169.85
5..3169.855 rows=1143234 loops=1)
 Index Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without
 time zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
 Planning Time: 0.659 ms
 Execution Time: 5289.787 ms
(13 rows)


PATCH:

  Limit  (cost=1559579.85..1559579.86 rows=1 width=32) (actual
time=.572...572 rows=1 loops=1)
   ->  Finalize Aggregate  (cost=1559579.85..1559579.86 rows=1
width=32) (actual time=.569...569 rows=1 loops=1)
 ->  Gather  (cost=1559579.52..1559579.83 rows=3 width=32)
(actual time=3328.619..3365.227 rows=4 loops=1)
   Workers Planned: 3
   Workers Launched: 3
   ->  Partial Aggregate  (cost=1558579.52..1558579.53
rows=1 width=32) (actual time=3307.805..3307.805 rows=1 loops=4)
 ->  Parallel Bitmap Heap Scan on lineitem
(cost=300405.84..1556701.72 rows=375560 width=12) (actual
time=1585.726..30
97.628 rows=285808 loops=4)
   Recheck Cond: ((l_shipdate >=
'1997-01-01'::date) AND (l_shipdate < '1998-01-01 00:00:00'::timestamp
without tim
e zone) AND (l_discount >= 0.02) AND (l_discount <= 0.04) AND
(l_quantity < '24'::numeric))
           Heap Blocks: exact=184293
   ->  Parallel Bitmap Index Sc