Hi, hackers. I've post a hash patch in a previous thread http://archives.postgresql.org/pgsql-hackers/2008-07/msg00794.php I do apologize for the bad readability of previous patch. Thank you all for your comments. Here is a new patch which fixed some bugs in the previous one. I post it here to get some feedback and further suggestion. Any comment is welcome. Changes since v1: - fix bug that it crashed in _h_spool when test big data set - adjust the target-fillfactor calculation in _hash_metapinit - remove the HASHVALUE_ONLY macro - replace _create_hash_desc with _get_hash_desc to get a hard-coded hash index tuple. - replace index_getattr with _hash_get_datum to get the hash key datum and avoid too many calls to _get_hash_desc and index_getattr
Here is what I intend to do. Todo: - get the statistics of block access i/o - write unit tests using pgunitest to test the following: (Josh Berkus suggested in this thread http://archives.postgresql.org/pgsql-hackers/2008-05/msg00535.php ) bulk load, both COPY and INSERT single-row updates, inserts and deletes batch update by key batch update by other index batch delete by key batch delete by other index concurrent index updates (64 connections insert/deleting concurrently) I makes some simple test mentioned here ( http://archives.postgresql.org/pgsql-hackers/2007-09/msg00208.php) I'll make some test on bigger data set later. using a word list of 3628800 unique words The table size is 139MB. Index BuildTime IndexSize ---- ---- ---- btree 51961.123 ms 93MB hash 411069.264 ms 2048MB hash-patch 36288.931 ms 128MB dict=# SELECT * from hash-dict where word = '0234567891' ; word ------------ 0234567891 (1 row) Time: 33.960 ms dict=# SELECT * from btree-dict where word = '0234567891' ; word ------------ 0234567891 (1 row) Time: 1.662 ms dict=# SELECT * from hash2-dict where word = '0234567891' ; word ------------ 0234567891 (1 row) Time: 1.457 ms At last, there is a problem I encounter. I'm confused by the function _hash_checkqual. IMHO, the index tuple only store one column here and key->sk_attno should always be 1 here. And scanKeySize should be 1 since we didn't support multi-column hash yet. Do I make some misunderstanding? /* * _hash_checkqual -- does the index tuple satisfy the scan conditions? */ bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup) { TupleDesc tupdesc = RelationGetDescr(scan->indexRelation); ScanKey key = scan->keyData; int scanKeySize = scan->numberOfKeys; IncrIndexProcessed(); while (scanKeySize > 0) { Datum datum; bool isNull; Datum test; datum = index_getattr(itup, key->sk_attno, tupdesc, &isNull); /* assume sk_func is strict */ if (isNull) return false; if (key->sk_flags & SK_ISNULL) return false; test = FunctionCall2(&key->sk_func, datum, key->sk_argument); if (!DatumGetBool(test)) return false; key++; scanKeySize--; } return true; } Hope to hear from you. -- Best Regards, Xiao Meng DKERC, Harbin Institute of Technology, China Gtalk: [EMAIL PROTECTED] MSN: [EMAIL PROTECTED] http://xiaomeng.yo2.cn
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 6a5c000..140142d 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -129,8 +129,8 @@ hashbuildCallback(Relation index, IndexTuple itup; /* form an index tuple and point it at the heap tuple */ - itup = index_form_tuple(RelationGetDescr(index), values, isnull); - itup->t_tid = htup->t_self; + itup = _hash_form_tuple(index, values,isnull); + itup->t_tid = htup->t_self; /* Hash indexes don't index nulls, see notes in hashinsert */ if (IndexTupleHasNulls(itup)) @@ -153,8 +153,8 @@ hashbuildCallback(Relation index, /* * hashinsert() -- insert an index tuple into a hash table. * - * Hash on the index tuple's key, find the appropriate location - * for the new tuple, and put it there. + * Hash on the heap tuple's key, form an index tuple with hash code. + * Find the appropriate location for the new tuple, and put it there. */ Datum hashinsert(PG_FUNCTION_ARGS) @@ -171,8 +171,8 @@ hashinsert(PG_FUNCTION_ARGS) IndexTuple itup; /* generate an index tuple */ - itup = index_form_tuple(RelationGetDescr(rel), values, isnull); - itup->t_tid = *ht_ctid; + itup = _hash_form_tuple(rel, values, isnull); + itup->t_tid = *ht_ctid; /* * If the single index key is null, we don't insert it into the index. @@ -211,8 +211,8 @@ hashgettuple(PG_FUNCTION_ARGS) OffsetNumber offnum; bool res; - /* Hash indexes are never lossy (at the moment anyway) */ - scan->xs_recheck = false; + /* Hash indexes maybe lossy since we store hash code only */ + scan->xs_recheck = true; /* * We hold pin but not lock on current buffer while outside the hash AM. diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c index 3eb226a..798c6e7 100644 --- a/src/backend/access/hash/hashinsert.c +++ b/src/backend/access/hash/hashinsert.c @@ -44,17 +44,15 @@ _hash_doinsert(Relation rel, IndexTuple itup) uint32 hashkey; Bucket bucket; Datum datum; - bool isnull; /* - * Compute the hash key for the item. We do this first so as not to need - * to hold any locks while running the hash function. + * Get the hash key for the item. */ if (rel->rd_rel->relnatts != 1) elog(ERROR, "hash indexes support only one index key"); - datum = index_getattr(itup, 1, RelationGetDescr(rel), &isnull); - Assert(!isnull); - hashkey = _hash_datum2hashkey(rel, datum); + + datum = _hash_get_datum(itup); + hashkey = DatumGetUInt32(datum); /* compute item size too */ itemsz = IndexTupleDSize(*itup); @@ -195,13 +193,18 @@ _hash_pgaddtup(Relation rel, Size itemsize, IndexTuple itup) { - OffsetNumber itup_off; - Page page; + OffsetNumber itup_off; + Page page; + Datum datum; + uint32 hashkey; _hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE); page = BufferGetPage(buf); - itup_off = OffsetNumberNext(PageGetMaxOffsetNumber(page)); + datum = _hash_get_datum(itup); + hashkey = DatumGetUInt32(datum); + itup_off = _hash_binsearch(page, hashkey); + if (PageAddItem(page, (Item) itup, itemsize, itup_off, false, false) == InvalidOffsetNumber) elog(ERROR, "failed to add index item to \"%s\"", diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c index b0b5874..770753a 100644 --- a/src/backend/access/hash/hashpage.c +++ b/src/backend/access/hash/hashpage.c @@ -348,11 +348,9 @@ _hash_metapinit(Relation rel, double num_tuples) * Determine the target fill factor (in tuples per bucket) for this index. * The idea is to make the fill factor correspond to pages about as full * as the user-settable fillfactor parameter says. We can compute it - * exactly if the index datatype is fixed-width, but for var-width there's - * some guessing involved. + * exactly since the index datatype (i.e. hash code of uint32 ) is fixed-width. */ - data_width = get_typavgwidth(RelationGetDescr(rel)->attrs[0]->atttypid, - RelationGetDescr(rel)->attrs[0]->atttypmod); + data_width = 4; item_width = MAXALIGN(sizeof(IndexTupleData)) + MAXALIGN(data_width) + sizeof(ItemIdData); /* include the line pointer */ ffactor = RelationGetTargetPageUsage(rel, HASH_DEFAULT_FILLFACTOR) / item_width; @@ -785,7 +783,7 @@ _hash_splitbucket(Relation rel, OffsetNumber omaxoffnum; Page opage; Page npage; - TupleDesc itupdesc = RelationGetDescr(rel); + /* * It should be okay to simultaneously write-lock pages from each bucket, @@ -848,14 +846,11 @@ _hash_splitbucket(Relation rel, /* * Re-hash the tuple to determine which bucket it now belongs in. * - * It is annoying to call the hash function while holding locks, but - * releasing and relocking the page for each tuple is unappealing too. + * We needn't call hash function since we store hash code in the index tuple */ itup = (IndexTuple) PageGetItem(opage, PageGetItemId(opage, ooffnum)); - datum = index_getattr(itup, 1, itupdesc, &null); - Assert(!null); - - bucket = _hash_hashkey2bucket(_hash_datum2hashkey(rel, datum), + datum = _hash_get_datum(itup); + bucket = _hash_hashkey2bucket(DatumGetUInt32(datum), maxbucket, highmask, lowmask); if (bucket == nbucket) diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c index 258526b..e11de76 100644 --- a/src/backend/access/hash/hashsearch.c +++ b/src/backend/access/hash/hashsearch.c @@ -178,6 +178,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir) hashkey = _hash_datum2hashkey_type(rel, cur->sk_argument, cur->sk_subtype); + so->hashso_sk_hash = hashkey; /* * Acquire shared split lock so we can compute the target bucket safely * (see README). @@ -289,82 +290,51 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) * continue to step through tuples until: 1) we get to the end of the * bucket chain or 2) we find a valid tuple. */ - do - { - switch (dir) + for (;;) + { + if (offnum == InvalidOffsetNumber) + { + /* + * This is the first time we're scanning this particular + * page of the bucket, so jump to the right spot via + * binary search. + */ + offnum = _hash_binsearch(page, so->hashso_sk_hash); + } + else { - case ForwardScanDirection: - if (offnum != InvalidOffsetNumber) - offnum = OffsetNumberNext(offnum); /* move forward */ - else - offnum = FirstOffsetNumber; /* new page */ - - while (offnum > maxoff) - { - /* - * either this page is empty (maxoff == - * InvalidOffsetNumber) or we ran off the end. - */ - _hash_readnext(rel, &buf, &page, &opaque); - if (BufferIsValid(buf)) - { - maxoff = PageGetMaxOffsetNumber(page); - offnum = FirstOffsetNumber; - } - else - { - /* end of bucket */ - maxoff = offnum = InvalidOffsetNumber; - break; /* exit while */ - } - } - break; - - case BackwardScanDirection: - if (offnum != InvalidOffsetNumber) - offnum = OffsetNumberPrev(offnum); /* move back */ - else - offnum = maxoff; /* new page */ - - while (offnum < FirstOffsetNumber) - { - /* - * either this page is empty (offnum == - * InvalidOffsetNumber) or we ran off the end. - */ - _hash_readprev(rel, &buf, &page, &opaque); - if (BufferIsValid(buf)) - maxoff = offnum = PageGetMaxOffsetNumber(page); - else - { - /* end of bucket */ - maxoff = offnum = InvalidOffsetNumber; - break; /* exit while */ - } - } - break; - - default: - /* NoMovementScanDirection */ - /* this should not be reached */ - break; + /* Advance to the next tuple */ + offnum = OffsetNumberNext(offnum); } - /* we ran off the end of the world without finding a match */ - if (offnum == InvalidOffsetNumber) + if (offnum <= maxoff) itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); + if (offnum <= maxoff && _hash_checkqual(scan, itup)) { - *bufP = so->hashso_curbuf = InvalidBuffer; - ItemPointerSetInvalid(current); - return false; + /* Found a matching tuple */ + *bufP = so->hashso_curbuf = buf; + ItemPointerSet(current, BufferGetBlockNumber(buf), offnum); + return true; } - - /* get ready to check this tuple */ - itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); - } while (!_hash_checkqual(scan, itup)); - - /* if we made it to here, we've found a valid tuple */ - blkno = BufferGetBlockNumber(buf); - *bufP = so->hashso_curbuf = buf; - ItemPointerSet(current, blkno, offnum); - return true; + else + { + /* No more matches on this page, so go on to next page */ + if (ScanDirectionIsForward(dir)) + _hash_readnext(rel, &buf, &page, &opaque); + else + _hash_readprev(rel, &buf, &page, &opaque); + + if (BufferIsValid(buf)) + { + maxoff = PageGetMaxOffsetNumber(page); + offnum = InvalidOffsetNumber; + } + else + { + /* Ran out of pages, so we're done */ + *bufP = so->hashso_curbuf = InvalidBuffer; + ItemPointerSetInvalid(current); + return false; + } + } + } } diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c index 41e2eef..31396d1 100644 --- a/src/backend/access/hash/hashutil.c +++ b/src/backend/access/hash/hashutil.c @@ -20,34 +20,38 @@ #include "executor/execdebug.h" #include "storage/bufmgr.h" #include "utils/lsyscache.h" - - +#include "utils/memutils.h" +#include "catalog/pg_type.h" +/* + * hardcoded tuple descriptors. see include/access/hash.h + */ +static FormData_pg_attribute Desc_hash[1] = {Schema_Hash}; /* * _hash_checkqual -- does the index tuple satisfy the scan conditions? */ bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup) { - TupleDesc tupdesc = RelationGetDescr(scan->indexRelation); ScanKey key = scan->keyData; int scanKeySize = scan->numberOfKeys; + Datum datum; + HashScanOpaque so = scan->opaque; IncrIndexProcessed(); + datum = _hash_get_datum(itup); + if( so->hashso_sk_hash != DatumGetUInt32(datum) ) + return false; + key++; + scanKeySize--; + while (scanKeySize > 0) { - Datum datum; - bool isNull; Datum test; - - datum = index_getattr(itup, - key->sk_attno, - tupdesc, - &isNull); + + datum = _hash_get_datum(itup); /* assume sk_func is strict */ - if (isNull) - return false; if (key->sk_flags & SK_ISNULL) return false; @@ -222,3 +226,100 @@ hashoptions(PG_FUNCTION_ARGS) PG_RETURN_BYTEA_P(result); PG_RETURN_NULL(); } + +/* + * _get_hash_desc - get the hash index tuple descriptor + * + * The hash index tuple descriptor is a hard-coded tuple descriptor with only int32 attribute. + */ +TupleDesc _get_hash_desc() +{ + static TupleDesc hashdesc = NULL; + + /* Already done? */ + if (hashdesc == NULL){ + /* + * It's the same with BuildHardcodedDescriptor() in relcache.c to build + * a hard-coded tuple descriptor. + */ + MemoryContext oldcxt; + + oldcxt = MemoryContextSwitchTo(CacheMemoryContext); + + hashdesc = CreateTemplateTupleDesc(1, false); + hashdesc->tdtypeid = INT4OID; + hashdesc->tdtypmod = -1; + memcpy(hashdesc->attrs[0], &Desc_hash[0], ATTRIBUTE_TUPLE_SIZE); + hashdesc->attrs[0]->attcacheoff = 0; + + MemoryContextSwitchTo(oldcxt); + } + + return hashdesc; + +} + +/* + * _hash_get_datum - get the hash index tuple's first column datum (i.e. hash code) + */ +Datum _hash_get_datum(IndexTuple itup) +{ + return fetch_att( (char *) (itup) + IndexInfoFindDataOffset((itup)->t_info) , + true, + 4); + +} +/* + * _hash_form_tuple - form a tuple with hash code only + */ +IndexTuple _hash_form_tuple(Relation rel, Datum* values, bool* isnull) +{ + TupleDesc hashdesc; + IndexTuple itup; + uint32 hashkey; + + hashdesc = _get_hash_desc(); + hashkey = _hash_datum2hashkey(rel, values[0]); + values[0] = UInt32GetDatum(hashkey); + itup = index_form_tuple(hashdesc, values, isnull); + return itup; +} + +/* + * _hash_binsearch - Return the offset number in the page where the specified hash value + * should be located. + * + * The return value might exceed the page's max offset + * if the hash value is greater than any hash in the page. + */ +OffsetNumber +_hash_binsearch(Page page, uint32 hash_value) +{ + OffsetNumber upper; + OffsetNumber lower; + + upper = PageGetMaxOffsetNumber(page) + 1; + lower = FirstOffsetNumber; + + while (upper > lower) + { + IndexTuple pos; + OffsetNumber off; + uint32 hashkey; + Datum datum; + bool isNull; + + off = (upper + lower) / 2; + Assert(OffsetNumberIsValid(off)); + + pos = (IndexTuple) PageGetItem(page, PageGetItemId(page, off)); + datum = _hash_get_datum(pos); + hashkey = DatumGetUInt32(datum); + if (hashkey < hash_value) + lower = off + 1; + else + upper = off; + } + + return upper; +} diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 3dda4ba..046d327 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -450,10 +450,13 @@ static int comparetup_index_btree(const SortTuple *a, const SortTuple *b, static int comparetup_index_hash(const SortTuple *a, const SortTuple *b, Tuplesortstate *state); static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup); +static void copytup_index_hash(Tuplesortstate *state, SortTuple *stup, void *tup); static void writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup); static void readtup_index(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len); +static void readtup_index_hash(Tuplesortstate *state, SortTuple *stup, + int tapenum, unsigned int len); static void reversedirection_index_btree(Tuplesortstate *state); static void reversedirection_index_hash(Tuplesortstate *state); static int comparetup_datum(const SortTuple *a, const SortTuple *b, @@ -672,9 +675,9 @@ tuplesort_begin_index_hash(Relation indexRel, state->nKeys = 1; /* Only one sort column, the hash code */ state->comparetup = comparetup_index_hash; - state->copytup = copytup_index; + state->copytup = copytup_index_hash; state->writetup = writetup_index; - state->readtup = readtup_index; + state->readtup = readtup_index_hash; state->reversedirection = reversedirection_index_hash; state->indexRel = indexRel; @@ -2805,9 +2808,7 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b, Tuplesortstate *state) { /* - * It's slightly annoying to redo the hash function each time, although - * most hash functions ought to be cheap. Is it worth having a variant - * tuple storage format so we can store the hash code? + * we needn't redo hash function each time since we've stored it. */ uint32 hash1; uint32 hash2; @@ -2819,10 +2820,10 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b, /* Compute hash codes and mask off bits we don't want to sort by */ Assert(!a->isnull1); - hash1 = DatumGetUInt32(FunctionCall1(state->hash_proc, a->datum1)) + hash1 = DatumGetUInt32(a->datum1) & state->hash_mask; Assert(!b->isnull1); - hash2 = DatumGetUInt32(FunctionCall1(state->hash_proc, b->datum1)) + hash2 = DatumGetUInt32(b->datum1) & state->hash_mask; if (hash1 > hash2) @@ -2877,6 +2878,22 @@ copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup) } static void +copytup_index_hash(Tuplesortstate *state, SortTuple *stup, void *tup) +{ + IndexTuple tuple = (IndexTuple) tup; + unsigned int tuplen = IndexTupleSize(tuple); + IndexTuple newtuple; + + /* copy the tuple into sort storage */ + newtuple = (IndexTuple) palloc(tuplen); + memcpy(newtuple, tuple, tuplen); + USEMEM(state, GetMemoryChunkSpace(newtuple)); + stup->tuple = (void *) newtuple; + /* set up first-column key value(i.e. hash code) */ + stup->datum1 = _hash_get_datum(newtuple); + stup->isnull1 = false; +} +static void writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup) { IndexTuple tuple = (IndexTuple) stup->tuple; @@ -2919,6 +2936,27 @@ readtup_index(Tuplesortstate *state, SortTuple *stup, } static void +readtup_index_hash(Tuplesortstate *state, SortTuple *stup, + int tapenum, unsigned int len) +{ + unsigned int tuplen = len - sizeof(unsigned int); + IndexTuple tuple = (IndexTuple) palloc(tuplen); + + USEMEM(state, GetMemoryChunkSpace(tuple)); + if (LogicalTapeRead(state->tapeset, tapenum, (void *) tuple, + tuplen) != tuplen) + elog(ERROR, "unexpected end of data"); + if (state->randomAccess) /* need trailing length word? */ + if (LogicalTapeRead(state->tapeset, tapenum, (void *) &tuplen, + sizeof(tuplen)) != sizeof(tuplen)) + elog(ERROR, "unexpected end of data"); + stup->tuple = (void *) tuple; + /* set up first-column key value(i.e. hash code) */ + stup->datum1 = _hash_get_datum(tuple); + stup->isnull1 = false; +} + +static void reversedirection_index_btree(Tuplesortstate *state) { ScanKey scanKey = state->indexScanKey; diff --git a/src/include/access/hash.h b/src/include/access/hash.h index ab0824d..74d984a 100644 --- a/src/include/access/hash.h +++ b/src/include/access/hash.h @@ -100,6 +100,8 @@ typedef struct HashScanOpaqueData /* Current and marked position of the scan */ ItemPointerData hashso_curpos; ItemPointerData hashso_mrkpos; + /* Hash value of the scan key */ + uint32 hashso_sk_hash; } HashScanOpaqueData; typedef HashScanOpaqueData *HashScanOpaque; @@ -227,7 +229,11 @@ typedef HashMetaPageData *HashMetaPage; */ #define HASHPROC 1 - +/* + * hard-coded hash desc + */ +#define Schema_Hash \ +{ 0, {"hashcode"}, 23, -1, 4, 1, 0, -1, -1, true, 'p', 'i', false, false, false, true, 0 } /* public routines */ extern Datum hashbuild(PG_FUNCTION_ARGS); @@ -330,6 +336,10 @@ extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket, uint32 highmask, uint32 lowmask); extern uint32 _hash_log2(uint32 num); extern void _hash_checkpage(Relation rel, Buffer buf, int flags); +extern TupleDesc _get_hash_desc(); +extern Datum _hash_get_datum(IndexTuple itup); +extern IndexTuple _hash_form_tuple(Relation rel, Datum* values, bool* isnull); +OffsetNumber _hash_binsearch(Page page, uint32 hash_value); /* hash.c */ extern void hash_redo(XLogRecPtr lsn, XLogRecord *record);
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers