On Wed, Nov 23, 2016 at 7:24 AM, Robert Haas <robertmh...@gmail.com> wrote: > So, I had a brief look at this tonight. This is not a full review, > but just some things I noticed:
Thanks for the review.. > > + * Update snpashot info in heap scan descriptor. > > Typo. Also, why should we have a function for this at all? And if we > do have a function for this, why should it have "bm" in the name when > it's stored in heapam.c? We are updating snapshot in HeapScanDesc, that's the reason I am using function and kept in heapam.c file like other function heap_beginscan_bm. But I think we can change it's name bm is not required in this, this function don't do anything specific for bm. I will change this. > > + * [PARALLEL BITMAP HEAP SCAN ALGORITHM] > + * > + * #1. Shared TIDBitmap creation and initialization > + * a) First worker to see the state as parallel bitmap info as > + * PBM_INITIAL become leader and set the state to PBM_INPROGRESS > + * All other workers see the state as PBM_INPROGRESS will wait for > + * leader to complete the TIDBitmap. > + * > + * Leader Worker Processing: > + * (Leader is responsible for creating shared TIDBitmap and create > + * shared page and chunk array from TIDBitmap.) > + * 1) Create TIDBitmap using DHT. > + * 2) Begin Iterate: convert hash table into shared page and chunk > + * array. > + * 3) Restore local TIDBitmap variable information into > + * ParallelBitmapInfo so that other worker can see those. > + * 4) set state to PBM_FINISHED. > + * 5) Wake up other workers. > + * > + * Other Worker Processing: > + * 1) Wait until leader create shared TIDBitmap and shared page > + * and chunk array. > + * 2) Attach to shared page table, copy TIDBitmap from > + * ParallelBitmapInfo to local TIDBitmap, we copy this to local > + * TIDBitmap so that next level processing can read information > + * same as in non parallel case and we can avoid extra changes > + * in code. > + * > + * # At this level TIDBitmap is ready and all workers are awake # > + * > + * #2. Bitmap processing (Iterate and process the pages). > + * . In this phase each worker will iterate over page and > chunk array and > + * select heap pages one by one. If prefetch is enable then there will > + * be two iterator. > + * . Since multiple worker are iterating over same page and chunk > array > + * we need to have a shared iterator, so we grab a spin lock and > iterate > + * within a lock. > > The formatting of this comment is completely haphazard. I will fix this.. "Leader > worker" is not a term that has any meaning. A given backend involved > in a parallel query is either a leader or a worker, not both. I agree this is confusing, but we can't call it directly a leader because IMHO we meant by a leader who, actually spawns all worker and gather the results. But here I meant by "leader worker" is the worker which takes responsibility of building tidbitmap, which can be any worker. So I named it "leader worker". Let me think what we can call it. > > + /* reset parallel bitmap scan info, if present */ > + if (node->parallel_bitmap) > + { > + ParallelBitmapInfo *pbminfo = node->parallel_bitmap; > + > + pbminfo->state = PBM_INITIAL; > + pbminfo->tbmiterator.schunkbit = 0; > + pbminfo->tbmiterator.spageptr = 0; > + pbminfo->tbmiterator.schunkptr = 0; > + pbminfo->prefetch_iterator.schunkbit = 0; > + pbminfo->prefetch_iterator.spageptr = 0; > + pbminfo->prefetch_iterator.schunkptr = 0; > + pbminfo->prefetch_pages = 0; > + pbminfo->prefetch_target = -1; > + } > > This is obviously not going to work in the face of concurrent > activity. When we did Parallel Seq Scan, we didn't make any changes > to the rescan code at all. I think we are assuming that there's no > way to cause a rescan of the driving table of a parallel query; if > that's wrong, we'll need some fix, but this isn't it. I would just > leave this out. In heap_rescan function we have similar change.. if (scan->rs_parallel != NULL) { parallel_scan = scan->rs_parallel; SpinLockAcquire(¶llel_scan->phs_mutex); parallel_scan->phs_cblock = parallel_scan->phs_startblock; SpinLockRelease(¶llel_scan->phs_mutex); } And this is not for driving table, it's required for the case when gather is as inner node for JOIN. consider below example. I know it's a bad plan but we can produce this plan :) Outer Node Inner Node SeqScan(t1) NLJ (Gather -> Parallel SeqScan (t2)). Reason for doing same is that, during ExecReScanGather we don't recreate new DSM instead of that we just reinitialise the DSM (ExecParallelReinitialize). > > +static bool > +pbms_is_leader(ParallelBitmapInfo *pbminfo) > > I think you should see if you can use Thomas Munro's barrier stuff for > this instead. Okay, I will think over it. > > + SerializeSnapshot(estate->es_snapshot, pbminfo->phs_snapshot_data); > + > + shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pbminfo); > + > + node->parallel_bitmap = pbminfo; > + snapshot = RestoreSnapshot(pbminfo->phs_snapshot_data); > + > + heap_bm_update_snapshot(node->ss.ss_currentScanDesc, snapshot); > > This doesn't make any sense. You serialize the snapshot from the > estate, then restore it, then shove it into the scan descriptor. But > presumably that's already the snapshot the scan descriptor is using. > The workers need to do this, perhaps, but not the leader! This is wrong, will fix this. > > + dht_parameters params = {0}; > > Not PostgreSQL style. I will fix.. > > +#define TBM_IS_SHARED(tbm) (tbm)->shared > > Seems pointless. Ok.. > > + bool shared; /* need to build shared tbm if set*/ > > Style. Ok. > > + params.tranche_id = LWLockNewTrancheId(); > > You absolutely, positively cannot burn through tranche IDs like this. > > + if (tbm->shared_pagetable) > + dht_detach(tbm->shared_pagetable); > > Hmm, would we leak references if we errored out? I will check on this part. Anyway, In my next version I am working on making my patch independent of DHT, so this part will be removed. > > @@ -47,7 +47,6 @@ typedef enum > > static List *translate_sub_tlist(List *tlist, int relid); > > - > > /***************************************************************************** > * MISC. PATH UTILITIES > > *****************************************************************************/ > > Useless whitespace change. > > @@ -23,7 +23,6 @@ > #include "utils/relcache.h" > #include "utils/snapshot.h" > > - > /* "options" flag bits for heap_insert */ > #define HEAP_INSERT_SKIP_WAL 0x0001 > #define HEAP_INSERT_SKIP_FSM 0x0002 > > Useless whitespace change. > > WAIT_EVENT_MQ_RECEIVE, > WAIT_EVENT_MQ_SEND, > WAIT_EVENT_PARALLEL_FINISH, > + WAIT_EVENT_PARALLEL_BITMAP_SCAN, > WAIT_EVENT_SAFE_SNAPSHOT, > WAIT_EVENT_SYNC_REP > > Missing a documentation update. I will fix these, in next version. > > In general, the amount of change in nodeBitmapHeapScan.c seems larger > than I would have expected. My copy of that file has 655 lines; this > patch adds 544 additional lines. I think/hope that some of that can > be simplified away. I will work on this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers