Re: [HACKERS] Parallel bitmap heap scan

Dilip Kumar Tue, 22 Nov 2016 23:02:19 -0800

On Wed, Nov 23, 2016 at 7:24 AM, Robert Haas <[email protected]> wrote:
> So, I had a brief look at this tonight.  This is not a full review,
> but just some things I noticed:


Thanks for the review..

>
> + *        Update snpashot info in heap scan descriptor.
>
> Typo.  Also, why should we have a function for this at all?  And if we
> do have a function for this, why should it have "bm" in the name when
> it's stored in heapam.c?
We are updating snapshot in HeapScanDesc, that's the reason I am using
function and kept in heapam.c file like other function
heap_beginscan_bm.
But I think we can change it's name bm is not required in this, this
function don't do anything specific for bm. I will change this.

>
> + *    [PARALLEL BITMAP HEAP SCAN ALGORITHM]
> + *
> + *    #1. Shared TIDBitmap creation and initialization
> + *        a) First worker to see the state as parallel bitmap info as
> + *        PBM_INITIAL become leader and set the state to PBM_INPROGRESS
> + *        All other workers see the state as PBM_INPROGRESS will wait for
> + *        leader to complete the TIDBitmap.
> + *
> + *        Leader Worker Processing:
> + *        (Leader is responsible for creating shared TIDBitmap and create
> + *        shared page and chunk array from TIDBitmap.)
> + *            1) Create TIDBitmap using DHT.
> + *            2) Begin Iterate: convert hash table into shared page and chunk
> + *            array.
> + *            3) Restore local TIDBitmap variable information into
> + *            ParallelBitmapInfo so that other worker can see those.
> + *            4) set state to PBM_FINISHED.
> + *            5) Wake up other workers.
> + *
> + *        Other Worker Processing:
> + *            1) Wait until leader create shared TIDBitmap and shared page
> + *            and chunk array.
> + *            2) Attach to shared page table, copy TIDBitmap from
> + *            ParallelBitmapInfo to local TIDBitmap, we copy this to local
> + *            TIDBitmap so that next level processing can read information
> + *            same as in non parallel case and we can avoid extra changes
> + *            in code.
> + *
> + *    # At this level TIDBitmap is ready and all workers are awake #
> + *
> + *    #2. Bitmap processing (Iterate and process the pages).
> + *        . In this phase each worker will iterate over page and
> chunk array and
> + *        select heap pages one by one. If prefetch is enable then there will
> + *        be two iterator.
> + *        . Since multiple worker are iterating over same page and chunk 
> array
> + *        we need to have a shared iterator, so we grab a spin lock and 
> iterate
> + *        within a lock.
>
> The formatting of this comment is completely haphazard.

I will fix this..

"Leader
> worker" is not a term that has any meaning.  A given backend involved
> in a parallel query is either a leader or a worker, not both.

I agree this is confusing, but we can't call it directly a leader
because IMHO we meant by a leader who, actually spawns all worker and
gather the results. But here I meant by "leader worker" is the worker
which takes responsibility of building tidbitmap, which can be any
worker. So I named it "leader worker". Let me think what we can call
it.

>
> +    /* reset parallel bitmap scan info, if present */
> +    if (node->parallel_bitmap)
> +    {
> +        ParallelBitmapInfo *pbminfo = node->parallel_bitmap;
> +
> +        pbminfo->state = PBM_INITIAL;
> +        pbminfo->tbmiterator.schunkbit = 0;
> +        pbminfo->tbmiterator.spageptr = 0;
> +        pbminfo->tbmiterator.schunkptr = 0;
> +        pbminfo->prefetch_iterator.schunkbit = 0;
> +        pbminfo->prefetch_iterator.spageptr = 0;
> +        pbminfo->prefetch_iterator.schunkptr = 0;
> +        pbminfo->prefetch_pages = 0;
> +        pbminfo->prefetch_target = -1;
> +    }
>
> This is obviously not going to work in the face of concurrent
> activity.  When we did Parallel Seq Scan, we didn't make any changes
> to the rescan code at all.  I think we are assuming that there's no
> way to cause a rescan of the driving table of a parallel query; if
> that's wrong, we'll need some fix, but this isn't it.  I would just
> leave this out.

In heap_rescan function we have similar change..

if (scan->rs_parallel != NULL)
{
  parallel_scan = scan->rs_parallel;
  SpinLockAcquire(&parallel_scan->phs_mutex);
  parallel_scan->phs_cblock = parallel_scan->phs_startblock;
  SpinLockRelease(&parallel_scan->phs_mutex);
}

And this is not for driving table, it's required for the case when
gather is as inner node for JOIN.
consider below example. I know it's a bad plan but we can produce this plan :)

Outer Node                Inner Node
SeqScan(t1)   NLJ   (Gather -> Parallel SeqScan (t2)).

Reason for doing same is that, during ExecReScanGather we don't
recreate new DSM instead of that we just reinitialise the DSM
(ExecParallelReinitialize).

>
> +static bool
> +pbms_is_leader(ParallelBitmapInfo *pbminfo)
>
> I think you should see if you can use Thomas Munro's barrier stuff for
> this instead.

Okay, I will think over it.

>
> +    SerializeSnapshot(estate->es_snapshot, pbminfo->phs_snapshot_data);
> +
> +    shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pbminfo);
> +
> +    node->parallel_bitmap = pbminfo;
> +    snapshot = RestoreSnapshot(pbminfo->phs_snapshot_data);
> +
> +    heap_bm_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
>
> This doesn't make any sense.  You serialize the snapshot from the
> estate, then restore it, then shove it into the scan descriptor. But
> presumably that's already the snapshot the scan descriptor is using.
> The workers need to do this, perhaps, but not the leader!

This is wrong, will fix this.

>
> +    dht_parameters params = {0};
>
> Not PostgreSQL style.

I will fix..
>
> +#define TBM_IS_SHARED(tbm) (tbm)->shared
>
> Seems pointless.
Ok..
>
> +    bool            shared;        /* need to build shared tbm if set*/
>
> Style.

Ok.
>
> +    params.tranche_id = LWLockNewTrancheId();
>
> You absolutely, positively cannot burn through tranche IDs like this.
>
> +    if (tbm->shared_pagetable)
> +        dht_detach(tbm->shared_pagetable);
>
> Hmm, would we leak references if we errored out?

I will check on this part. Anyway, In my next version I am working on
making my patch independent of DHT, so this part will be removed.

>
> @@ -47,7 +47,6 @@ typedef enum
>
>  static List *translate_sub_tlist(List *tlist, int relid);
>
> -
>  
> /*****************************************************************************
>   *        MISC. PATH UTILITIES
>   
> *****************************************************************************/
>
> Useless whitespace change.
>
> @@ -23,7 +23,6 @@
>  #include "utils/relcache.h"
>  #include "utils/snapshot.h"
>
> -
>  /* "options" flag bits for heap_insert */
>  #define HEAP_INSERT_SKIP_WAL    0x0001
>  #define HEAP_INSERT_SKIP_FSM    0x0002
>
> Useless whitespace change.
>
>      WAIT_EVENT_MQ_RECEIVE,
>      WAIT_EVENT_MQ_SEND,
>      WAIT_EVENT_PARALLEL_FINISH,
> +    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
>      WAIT_EVENT_SAFE_SNAPSHOT,
>      WAIT_EVENT_SYNC_REP
>
> Missing a documentation update.

I will fix these, in next version.
>
> In general, the amount of change in nodeBitmapHeapScan.c seems larger
> than I would have expected.  My copy of that file has 655 lines; this
> patch adds 544 additional lines.  I think/hope that some of that can
> be simplified away.

I will work on this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel bitmap heap scan

Reply via email to