Re: BRIN indexes (was Re: [HACKERS] Minmax indexes)
El 08/09/14 13:02, Alvaro Herrera escribió: > Here's version 18. I have renamed it: These are now BRIN indexes. > > I have fixed numerous race conditions and deadlocks. In particular I > fixed this problem you noted: > > Heikki Linnakangas wrote: >> Another race condition: >> >> If a new tuple is inserted to the range while summarization runs, >> it's possible that the new tuple isn't included in the tuple that >> the summarization calculated, nor does the insertion itself udpate >> it. > I did it mostly in the way you outlined, i.e. by way of a placeholder > tuple that gets updated by concurrent inserters and then the tuple > resulting from the scan is unioned with the values in the updated > placeholder tuple. This required the introduction of one extra support > proc for opclasses (pretty simple stuff anyhow). > > There should be only minor items left now, such as silencing the > > WARNING: concurrent insert in progress within table "sales" > > which is emitted by IndexBuildHeapScan (possibly thousands of times) > when doing a summarization of a range being inserted into or otherwise > modified. Basically the issue here is that IBHS assumes it's being run > with ShareLock in the heap (which blocks inserts), but here we're using > it with ShareUpdateExclusive only, which lets inserts in. There is no > harm AFAICS because of the placeholder tuple stuff I describe above. Debuging VACUUM VERBOSE ANALYZE over a concurrent table being updated/insert. (gbd) Breakpoint 1, errfinish (dummy=0) at elog.c:411 411ErrorData *edata = &errordata[errordata_stack_depth]; The complete backtrace is at http://pastebin.com/gkigSNm7 Also, I found pages with an unkown type (using deafult parameters for the index creation): brin_page_type | array_agg +--- unknown (00) | {3,4} revmap | {1} regular| {2} meta | {0} (4 rows) > > -- -- Emanuel Calvo @3manuek
Re: BRIN indexes (was Re: [HACKERS] Minmax indexes)
On 09/08/2014 07:02 PM, Alvaro Herrera wrote: Here's version 18. I have renamed it: These are now BRIN indexes. I have fixed numerous race conditions and deadlocks. In particular I fixed this problem you noted: Heikki Linnakangas wrote: Another race condition: If a new tuple is inserted to the range while summarization runs, it's possible that the new tuple isn't included in the tuple that the summarization calculated, nor does the insertion itself udpate it. I did it mostly in the way you outlined, i.e. by way of a placeholder tuple that gets updated by concurrent inserters and then the tuple resulting from the scan is unioned with the values in the updated placeholder tuple. This required the introduction of one extra support proc for opclasses (pretty simple stuff anyhow). Hmm. So the union support proc is only called if there is a race condition? That makes it very difficult to test, I'm afraid. It would make more sense to pass BrinValues to the support functions, rather than DeformedBrTuple. An opclass'es support function should never need to access the values for other columns. Does minmaxUnion handle NULLs correctly? minmaxUnion pfrees the old values. Is that necessary? What memory context does the function run in? If the code runs in a short-lived memory context, you might as well just let them leak. If it runs in a long-lived context, well, perhaps it shouldn't. It's nicer to write functions that can leak freely. IIRC, GiST and GIN runs the support functions in a temporary context. In any case, it might be worth noting explicitly in the docs which functions may leak and which may not. If you add a new datatype, and define b-tree operators for it, what is required to create a minmax opclass for it? Would it be possible to generalize the functions in brin_minmax.c so that they can be reused for any datatype (with b-tree operators) without writing any new C code? I think we're almost there; the only thing that differs between each data type is the opcinfo function. Let's pass the type OID as argument to the opcinfo function. You could then have just a single minmax_opcinfo function, instead of the macro to generate a separate function for each built-in datatype. In general, this patch is in pretty good shape now, thanks! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Alvaro Herrera wrote: > So here's v16, rebased on top of 9bac66020. As far as I am concerned, > this is the last version before I start renaming everything to BRIN and > then commit. FWIW in case you or others have interest, here's the diff between your patch and v16. Also, for illustrative purposes, the diff between versions yours and mine of the code that got moved to mmpageops.c because it's difficult to see it from the partial patch. (There's nothing to do with that partial diff other than read it directly.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services *** a/contrib/pageinspect/mmfuncs.c --- b/contrib/pageinspect/mmfuncs.c *** *** 29,35 PG_FUNCTION_INFO_V1(minmax_page_type); PG_FUNCTION_INFO_V1(minmax_page_items); PG_FUNCTION_INFO_V1(minmax_metapage_info); - PG_FUNCTION_INFO_V1(minmax_revmap_array_data); PG_FUNCTION_INFO_V1(minmax_revmap_data); typedef struct mm_column_state --- 29,34 *** *** 388,394 minmax_revmap_data(PG_FUNCTION_ARGS) values[0] = Int64GetDatum((uint64) 0); /* Extract (possibly empty) list of TIDs in this page. */ ! for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++) { ItemPointer tid; --- 387,393 values[0] = Int64GetDatum((uint64) 0); /* Extract (possibly empty) list of TIDs in this page. */ ! for (i = 0; i < REVMAP_PAGE_MAXITEMS; i++) { ItemPointer tid; *** a/src/backend/access/minmax/Makefile --- b/src/backend/access/minmax/Makefile *** *** 12,17 subdir = src/backend/access/minmax top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o include $(top_srcdir)/src/backend/common.mk --- 12,17 top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = minmax.o mmpageops.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o include $(top_srcdir)/src/backend/common.mk *** a/src/backend/access/minmax/minmax.c --- b/src/backend/access/minmax/minmax.c *** *** 15,45 */ #include "postgres.h" - #include "access/htup_details.h" #include "access/minmax.h" #include "access/minmax_internal.h" #include "access/minmax_page.h" ! #include "access/minmax_revmap.h" ! #include "access/minmax_tuple.h" #include "access/minmax_xlog.h" #include "access/reloptions.h" #include "access/relscan.h" - #include "access/xlogutils.h" #include "catalog/index.h" - #include "catalog/pg_operator.h" - #include "commands/vacuum.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" #include "storage/freespace.h" - #include "storage/indexfsm.h" - #include "storage/lmgr.h" - #include "storage/smgr.h" - #include "utils/datum.h" - #include "utils/lsyscache.h" - #include "utils/memutils.h" #include "utils/rel.h" - #include "utils/syscache.h" /* --- 15,33 */ #include "postgres.h" #include "access/minmax.h" #include "access/minmax_internal.h" #include "access/minmax_page.h" ! #include "access/minmax_pageops.h" #include "access/minmax_xlog.h" #include "access/reloptions.h" #include "access/relscan.h" #include "catalog/index.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" #include "storage/freespace.h" #include "utils/rel.h" /* *** *** 75,93 static MMBuildState *initialize_mm_buildstate(Relation idxRel, static bool terminate_mm_buildstate(MMBuildState *state); static void summarize_range(MMBuildState *mmstate, Relation heapRel, BlockNumber heapBlk); - static bool mm_doupdate(Relation idxrel, BlockNumber pagesPerRange, - mmRevmapAccess *rmAccess, BlockNumber heapBlk, - Buffer oldbuf, OffsetNumber oldoff, - const MMTuple *origtup, Size origsz, - const MMTuple *newtup, Size newsz, - bool samepage, bool *extended); - static void mm_doinsert(Relation idxrel, BlockNumber pagesPerRange, - mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapblkno, - MMTuple *tup, Size itemsz, bool *extended); - static Buffer mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz, - bool *extended); static void form_and_insert_tuple(MMBuildState *mmstate); - static Size mm_page_get_freespace(Page page); /* --- 63,69 *** *** 123,128 mminsert(PG_FUNCTION_ARGS) --- 99,105 rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange); restart: + CHECK_FOR_INTERRUPTS(); heapBlk = ItemPointerGetBlockNumber(heaptid); /* normalize the block number to be the first block in the range */ heapBlk = (heapBlk / pagesPerRange) * pagesPerRange; *** *** 155,161 restart: addValue = index_getprocinfo(idxRel, keyno + 1, MINMAX_PROCNUM_ADDVALUE); - result = FunctionCall5Coll(addValue, idxRel->rd_indcollation[keyno], PointerGetDatum(mmdesc), --- 132,137
Re: [HACKERS] Minmax indexes
Fujii Masao wrote: > I've not read the patch yet. But while testing the feature, I found that > > * Brin index cannot be created on CHAR(n) column. >Maybe other data types have the same problem. Yeah, it's just a matter of adding an opclass for it -- pretty simple stuff really, because you don't need to write any code, just add a bunch of catalog entries and an OPCINFO line in mmsortable.c. Right now there are opclasses for the following types: int4 numeric text date timestamp with time zone timestamp time with time zone time "char" We can eventually extend to cover all types that have btree opclasses, but we can do that in a separate commit. I'm also considering removing the opclass for time with time zone, as it's a pretty useless type. I mostly added the ones that are there as a way to test that it behaved reasonably in the various cases (pass by val vs. not, variable width vs. fixed, different alignment requirements) Of course, the real interesting part is adding a completely different opclass, such as one that stores bounding boxes. > * FILLFACTOR cannot be set in brin index. I hadn't added this one because I didn't think there was much point previously, but I think it might now be useful to allow same-page updates. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/15/2014 02:02 AM, Alvaro Herrera wrote: Alvaro Herrera wrote: Heikki Linnakangas wrote: I'm sure this still needs some cleanup, but here's the patch, based on your v14. Now that I know what this approach looks like, I still like it much better. The insert and update code is somewhat more complicated, because you have to be careful to lock the old page, new page, and revmap page in the right order. But it's not too bad, and it gets rid of all the complexity in vacuum. It seems there is some issue here, because pageinspect tells me the index is not growing properly for some reason. minmax_revmap_data gives me this array of TIDs after a bunch of insert/vacuum/delete/ etc: I fixed this issue, and did a lot more rework and bugfixing. Here's v15, based on v14-heikki2. So, the other design change I've been advocating is to store the revmap in the first N blocks, instead of having the two-level structure with array pages and revmap pages. Attached is a patch for that, to be applied after v15. When the revmap needs to be expanded, all the tuples on it are moved elsewhere one-by-one. That adds some latency to the unfortunate guy who needs to do that, but as the patch stands, the revmap is only ever extended by VACUUM or CREATE INDEX, so I think that's fine. Like with my previous patch, the point is to demonstrate how much simpler the code becomes this way; I'm sure there are bugs and cleanup still necessary. PS. Spotted one oversight in patch v15: callers of mm_doupdate must check the return value, and retry the operation if it returns false. - Heikki commit ce4df0e9dbd43f7e3d4fdf3f7920301f81f17d63 Author: Heikki Linnakangas Date: Fri Aug 15 18:32:19 2014 +0300 Get rid of array pages. Instead, move all tuples out of the way diff --git a/contrib/pageinspect/mmfuncs.c b/contrib/pageinspect/mmfuncs.c index 6cd559a..51cc9e2 100644 --- a/contrib/pageinspect/mmfuncs.c +++ b/contrib/pageinspect/mmfuncs.c @@ -74,9 +74,6 @@ minmax_page_type(PG_FUNCTION_ARGS) case MINMAX_PAGETYPE_META: type = "meta"; break; - case MINMAX_PAGETYPE_REVMAP_ARRAY: - type = "revmap array"; - break; case MINMAX_PAGETYPE_REVMAP: type = "revmap"; break; @@ -343,11 +340,9 @@ minmax_metapage_info(PG_FUNCTION_ARGS) Page page; MinmaxMetaPageData *meta; TupleDesc tupdesc; - Datum values[3]; - bool nulls[3]; - ArrayBuildState *astate = NULL; + Datum values[4]; + bool nulls[4]; HeapTuple htup; - int i; page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage"); @@ -361,22 +356,8 @@ minmax_metapage_info(PG_FUNCTION_ARGS) MemSet(nulls, 0, sizeof(nulls)); values[0] = CStringGetTextDatum(psprintf("0x%08X", meta->minmaxMagic)); values[1] = Int32GetDatum(meta->minmaxVersion); - - /* Extract (possibly empty) list of revmap array page numbers. */ - for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++) - { - BlockNumber blkno; - - blkno = meta->revmapArrayPages[i]; - if (blkno == InvalidBlockNumber) - break; /* XXX or continue? */ - astate = accumArrayResult(astate, Int64GetDatum((int64) blkno), - false, INT8OID, CurrentMemoryContext); - } - if (astate == NULL) - nulls[2] = true; - else - values[2] = makeArrayResult(astate, CurrentMemoryContext); + values[2] = Int32GetDatum(meta->pagesPerRange); + values[3] = Int64GetDatum(meta->lastRevmapPage); htup = heap_form_tuple(tupdesc, values, nulls); @@ -384,34 +365,6 @@ minmax_metapage_info(PG_FUNCTION_ARGS) } /* - * Return the BlockNumber array stored in a revmap array page - */ -Datum -minmax_revmap_array_data(PG_FUNCTION_ARGS) -{ - bytea *raw_page = PG_GETARG_BYTEA_P(0); - Page page; - ArrayBuildState *astate = NULL; - RevmapArrayContents *contents; - Datum blkarr; - int i; - - page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP_ARRAY, - "revmap array"); - - contents = (RevmapArrayContents *) PageGetContents(page); - - for (i = 0; i < contents->rma_nblocks; i++) - astate = accumArrayResult(astate, - Int64GetDatum((int64) contents->rma_blocks[i]), - false, INT8OID, CurrentMemoryContext); - Assert(astate != NULL); - - blkarr = makeArrayResult(astate, CurrentMemoryContext); - PG_RETURN_DATUM(blkarr); -} - -/* * Return the TID array stored in a minmax revmap page */ Datum @@ -437,7 +390,7 @@ minmax_revmap_data(PG_FUNCTION_ARGS) /* Extract values from the revmap page */ contents = (RevmapContents *) PageGetContents(page); MemSet(nulls, 0, sizeof(nulls)); - values[0] = Int64GetDatum((uint64) contents->rmr_logblk); + values[0] = Int64GetDatum((uint64) 0); /* Extract (possibly empty) list of TIDs in this page. */ for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++) diff --git a/contrib/pageinspect/pageinspect--1.2.sql b/contrib/pageinspect/pageinspect--1.2.sql index 56c9ba8..cba90ca 100644 --- a/contrib/pageinspect/pageinspect--1.2.sql +++ b/contrib/pageinspect/pageinspect--1.2.sql @@ -110,7 +110,7 @@ LANGUAGE C STRICT; -- minmax_metapage_info()
Re: [HACKERS] Minmax indexes
On Fri, Aug 15, 2014 at 8:02 AM, Alvaro Herrera wrote: > Alvaro Herrera wrote: >> Heikki Linnakangas wrote: > >> > I'm sure this still needs some cleanup, but here's the patch, based >> > on your v14. Now that I know what this approach looks like, I still >> > like it much better. The insert and update code is somewhat more >> > complicated, because you have to be careful to lock the old page, >> > new page, and revmap page in the right order. But it's not too bad, >> > and it gets rid of all the complexity in vacuum. >> >> It seems there is some issue here, because pageinspect tells me the >> index is not growing properly for some reason. minmax_revmap_data gives >> me this array of TIDs after a bunch of insert/vacuum/delete/ etc: > > I fixed this issue, and did a lot more rework and bugfixing. Here's > v15, based on v14-heikki2. I've not read the patch yet. But while testing the feature, I found that * Brin index cannot be created on CHAR(n) column. Maybe other data types have the same problem. * FILLFACTOR cannot be set in brin index. Are these intentional? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/15/2014 10:26 AM, Heikki Linnakangas wrote: On 08/15/2014 02:02 AM, Alvaro Herrera wrote: Alvaro Herrera wrote: Heikki Linnakangas wrote: I'm sure this still needs some cleanup, but here's the patch, based on your v14. Now that I know what this approach looks like, I still like it much better. The insert and update code is somewhat more complicated, because you have to be careful to lock the old page, new page, and revmap page in the right order. But it's not too bad, and it gets rid of all the complexity in vacuum. It seems there is some issue here, because pageinspect tells me the index is not growing properly for some reason. minmax_revmap_data gives me this array of TIDs after a bunch of insert/vacuum/delete/ etc: I fixed this issue, and did a lot more rework and bugfixing. Here's v15, based on v14-heikki2. Thanks! I think remaining issues are mostly minimal (pageinspect should output block number alongside each tuple, now that we have it, for example.) There's this one issue I left in my patch version that I think we should do something about: + /* +* No luck. Assume that the revmap was updated concurrently. +* +* XXX: it would be nice to add some kind of a sanity check here to +* avoid looping infinitely, if the revmap points to wrong tuple for +* some reason. +*/ This happens when we follow the revmap to a tuple, but find that the tuple points to a different block than what the revmap claimed. Currently, we just assume that it's because the tuple was updated concurrently, but while hacking, I frequently had a broken index where the revmap pointed to bogus tuples or the tuples had a missing/wrong block number on them, and ran into infinite loop here. It's clearly a case of a corrupt index and shouldn't happen, but I would imagine that it's a fairly typical way this would fail in production too because of hardware issues or bugs. So I think we need to work a bit harder to stop the looping and throw an error instead. Perhaps something as simple as keeping a loop counter and giving up after 1000 attempts would be good enough. The window between releasing the lock on the revmap, and acquiring the lock on the page containing the MMTuple is very narrow, so the chances of losing that race to a concurrent update more than 1-2 times in a row is vanishingly small. Reading the patch more closely, I see that you added a check that when we loop, we throw an error if the new item pointer in the revmap is the same as before. In theory, it's possible that two concurrent updates happen: one that moves the tuple we're looking for elsewhere, and another that moves it back again. The probability of that is also vanishingly small, so maybe that's OK. Or we could check the LSN; if the revmap has been updated, its LSN must've changed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/15/2014 02:02 AM, Alvaro Herrera wrote: Alvaro Herrera wrote: Heikki Linnakangas wrote: I'm sure this still needs some cleanup, but here's the patch, based on your v14. Now that I know what this approach looks like, I still like it much better. The insert and update code is somewhat more complicated, because you have to be careful to lock the old page, new page, and revmap page in the right order. But it's not too bad, and it gets rid of all the complexity in vacuum. It seems there is some issue here, because pageinspect tells me the index is not growing properly for some reason. minmax_revmap_data gives me this array of TIDs after a bunch of insert/vacuum/delete/ etc: I fixed this issue, and did a lot more rework and bugfixing. Here's v15, based on v14-heikki2. Thanks! I think remaining issues are mostly minimal (pageinspect should output block number alongside each tuple, now that we have it, for example.) There's this one issue I left in my patch version that I think we should do something about: + /* +* No luck. Assume that the revmap was updated concurrently. +* +* XXX: it would be nice to add some kind of a sanity check here to +* avoid looping infinitely, if the revmap points to wrong tuple for +* some reason. +*/ This happens when we follow the revmap to a tuple, but find that the tuple points to a different block than what the revmap claimed. Currently, we just assume that it's because the tuple was updated concurrently, but while hacking, I frequently had a broken index where the revmap pointed to bogus tuples or the tuples had a missing/wrong block number on them, and ran into infinite loop here. It's clearly a case of a corrupt index and shouldn't happen, but I would imagine that it's a fairly typical way this would fail in production too because of hardware issues or bugs. So I think we need to work a bit harder to stop the looping and throw an error instead. Perhaps something as simple as keeping a loop counter and giving up after 1000 attempts would be good enough. The window between releasing the lock on the revmap, and acquiring the lock on the page containing the MMTuple is very narrow, so the chances of losing that race to a concurrent update more than 1-2 times in a row is vanishingly small. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Heikki Linnakangas wrote: > I couldn't resist starting to hack on this, and implemented the > scheme I've been having in mind: > > 1. MMTuple contains the block number of the heap page (range) that > the tuple represents. Vacuum is no longer needed to clean up old > tuples; when an index tuples is updated, the old tuple is deleted > atomically with the insertion of a new tuple and updating the > revmap, so no garbage is left behind. > > 2. LockTuple is gone. When following the pointer from revmap to > MMTuple, the block number is used to check that you land on the > right tuple. If not, the search is started over, looking at the > revmap again. Thanks, looks good, yeah. Did you just forget to attach the access/rmgrdesc/minmaxdesc.c file, or did you ignore it altogether? Anyway I hacked one up, and cleaned up some other things. > I'm sure this still needs some cleanup, but here's the patch, based > on your v14. Now that I know what this approach looks like, I still > like it much better. The insert and update code is somewhat more > complicated, because you have to be careful to lock the old page, > new page, and revmap page in the right order. But it's not too bad, > and it gets rid of all the complexity in vacuum. It seems there is some issue here, because pageinspect tells me the index is not growing properly for some reason. minmax_revmap_data gives me this array of TIDs after a bunch of insert/vacuum/delete/ etc: "(2,1)","(2,2)","(2,3)","(2,4)","(2,5)","(4,1)","(5,1)","(6,1)","(7,1)","(8,1)","(9,1)","(10,1)","(11,1)","(12,1)","(13,1)","(14,1)","(15,1)","(16,1)","(17,1)","(18,1)","(19,1)","(20,1)","(21,1)","(22,1)","(23,1)","(24,1)","(25,1)","(26,1)","(27,1)","(28,1)","(29,1)","(30,1)","(31,1)","(32,1)","(33,1)","(34,1)","(35,1)","(36,1)","(37,1)","(38,1)","(39,1)","(40,1)","(41,1)","(42,1)","(43,1)","(44,1)","(45,1)","(46,1)","(47,1)","(48,1)","(49,1)","(50,1)","(51,1)","(52,1)","(53,1)","(54,1)","(55,1)","(56,1)","(57,1)","(58,1)","(59,1)","(60,1)","(61,1)","(62,1)","(63,1)","(64,1)","(65,1)","(66,1)","(67,1)","(68,1)","(69,1)","(70,1)","(71,1)","(72,1)","(73,1)","(74,1)","(75,1)","(76,1)","(77,1)","(78,1)","(79,1)","(80,1)","(81,1)","(82,1)","(83,1)","(84,1)","(85,1)","(86,1)","(87,1)","(88,1)","(89,1)","(90,1)","(91,1)","(92,1)","(93,1)","(94,1)","(95,1)","(96,1)","(97,1)","(98,1)","(99,1)","(100,1)","(101,1)","(102,1)","(103,1)","(104,1)","(105,1)","(106,1)","(107,1)","(108,1)","(109,1)","(110,1)","(111,1)","(112,1)","(113,1)","(114,1)","(115,1)","(116,1)","(117,1)","(118,1)","(119,1)","(120,1)","(121,1)","(122,1)","(123,1)","(124,1)","(125,1)","(126,1)","(127,1)","(128,1)","(129,1)","(130,1)","(131,1)","(132,1)","(133,1)","(134,1)" There are some who would think that getting one item per page is suboptimal. (Maybe it's just a missing FSM update somewhere.) I've been hacking away a bit more at this; will post updated patch probably tomorrow (was about to post but just found a memory stomp in pageinspect.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Aug 8, 2014 at 6:01 AM, Heikki Linnakangas wrote: > It's possible that two backends arrive at phase 3 at the same time, with > different values. For example, backend A wants to update the minimum to > contain 10, and and backend B wants to update it to 5. Now, if backend B > gets to update the tuple first, to 5, backend A will update the tuple to 10 > when it gets the lock, which is wrong. > > The simplest solution would be to get the buffer lock in exclusive mode to > begin with, so that you don't need to release it between steps 2 and 5. That > might be a significant hit on concurrency, though, when most of the > insertions don't in fact have to update the value. Another idea is to > re-check the updated values after acquiring the lock in exclusive mode, to > see if they match the previous values. No, the simplest solution is to re-check the bounds after acquiring the exclusive lock. So instead of doing the addValue with share lock, do a consistency check first, and if it's not consistent, do the addValue with exclusive lock. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/10/2014 12:42 PM, Simon Riggs wrote: On 8 August 2014 16:03, Heikki Linnakangas wrote: I couldn't resist starting to hack on this, and implemented the scheme I've been having in mind: 1. MMTuple contains the block number of the heap page (range) that the tuple represents. Vacuum is no longer needed to clean up old tuples; when an index tuples is updated, the old tuple is deleted atomically with the insertion of a new tuple and updating the revmap, so no garbage is left behind. 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the block number is used to check that you land on the right tuple. If not, the search is started over, looking at the revmap again. Part 2 sounds interesting, especially because of the reduction in CPU that it might allow. Part 1 doesn't sound good yet. Are they connected? Yes. The optimistic locking in part 2 is based on checking that the block number on the MMTuple matches what you're searching for, and that there is never more than one MMTuple in the index with the same block number. More importantly, can't we tweak this after commit? Delaying commit just means less time for other people to see, test, understand tune and fix. I see you (Heikki) doing lots of incremental development, lots of small commits. Can't we do this one the same? Well, I wouldn't consider "let's redesign how locking and vacuuming works and change the on-disk format" as incremental development ;-). It's more like, well, redesigning the whole thing. Any testing and tuning would certainly need to be redone after such big changes. If you agree that these changes make sense, let's do them now and not waste people's time testing and tuning a dead-end design. If you don't agree, then let's discuss that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/10/2014 12:22 PM, Simon Riggs wrote: On 8 August 2014 16:03, Heikki Linnakangas wrote: 1. MMTuple contains the block number of the heap page (range) that the tuple represents. Vacuum is no longer needed to clean up old tuples; when an index tuples is updated, the old tuple is deleted atomically with the insertion of a new tuple and updating the revmap, so no garbage is left behind. What happens if the transaction that does this aborts? Surely that means the new value is itself garbage? What cleans up that? It's no different from Alvaro's patch. The updated MMTuple covers the aborted value, but that's OK from a correctnes point of view. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 8 August 2014 16:03, Heikki Linnakangas wrote: > I couldn't resist starting to hack on this, and implemented the scheme I've > been having in mind: > > 1. MMTuple contains the block number of the heap page (range) that the tuple > represents. Vacuum is no longer needed to clean up old tuples; when an index > tuples is updated, the old tuple is deleted atomically with the insertion of > a new tuple and updating the revmap, so no garbage is left behind. > > 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the > block number is used to check that you land on the right tuple. If not, the > search is started over, looking at the revmap again. Part 2 sounds interesting, especially because of the reduction in CPU that it might allow. Part 1 doesn't sound good yet. Are they connected? More importantly, can't we tweak this after commit? Delaying commit just means less time for other people to see, test, understand tune and fix. I see you (Heikki) doing lots of incremental development, lots of small commits. Can't we do this one the same? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 8 August 2014 10:01, Heikki Linnakangas wrote: > It's possible that two backends arrive at phase 3 at the same time, with > different values. For example, backend A wants to update the minimum to > contain 10, and and backend B wants to update it to 5. Now, if backend B > gets to update the tuple first, to 5, backend A will update the tuple to 10 > when it gets the lock, which is wrong. > > The simplest solution would be to get the buffer lock in exclusive mode to > begin with, so that you don't need to release it between steps 2 and 5. That > might be a significant hit on concurrency, though, when most of the > insertions don't in fact have to update the value. Another idea is to > re-check the updated values after acquiring the lock in exclusive mode, to > see if they match the previous values. Simplest solution is to re-apply the test just before update, so in the above example, if we think we want to lower the minimum to 10 and when we get there it is already 5, we just don't update. We don't need to do the re-check always, though. We can read the page LSN while holding share lock, then re-read it once we acquire exclusive lock. If LSN is the same, no need for datatype specific re-checks at all. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 8 August 2014 16:03, Heikki Linnakangas wrote: > 1. MMTuple contains the block number of the heap page (range) that the tuple > represents. Vacuum is no longer needed to clean up old tuples; when an index > tuples is updated, the old tuple is deleted atomically with the insertion of > a new tuple and updating the revmap, so no garbage is left behind. What happens if the transaction that does this aborts? Surely that means the new value is itself garbage? What cleans up that? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Another race condition: If a new tuple is inserted to the range while summarization runs, it's possible that the new tuple isn't included in the tuple that the summarization calculated, nor does the insertion itself udpate it. 1. There is no index tuple for page range 1-10 2. Summarization begins. It scans pages 1-5. 3. A new insertion inserts a heap tuple to page 1. 4. The insertion sees that there is no index tuple covering range 1-10, so it doesn't update it. 5. The summarization finishes scanning pages 5-10, and inserts the new index tuple. The summarization didn't see the newly inserted heap tuple, and hence it's not included in the calculated index tuple. One idea is to do the summarization in two stages. First, insert a placeholder tuple, with no real value in it. A query considers the placeholder tuple the same as a missing tuple, ie. always considers it a match. An insertion updates the placeholder tuple with the value inserted, as if it was a regular mmtuple. After summarization has finished scanning the page range, it turns the placeholder tuple into a regular tuple, by unioning the placeholder value with the value formed by scanning the heap. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
I think there's a race condition in mminsert, if two backends insert a tuple to the same heap page range concurrently. mminsert does this: 1. Fetch the MMtuple for the page range 2. Check if any of the stored datums need updating 3. Unlock the page. 4. Lock the page again in exclusive mode. 5. Update the tuple. It's possible that two backends arrive at phase 3 at the same time, with different values. For example, backend A wants to update the minimum to contain 10, and and backend B wants to update it to 5. Now, if backend B gets to update the tuple first, to 5, backend A will update the tuple to 10 when it gets the lock, which is wrong. The simplest solution would be to get the buffer lock in exclusive mode to begin with, so that you don't need to release it between steps 2 and 5. That might be a significant hit on concurrency, though, when most of the insertions don't in fact have to update the value. Another idea is to re-check the updated values after acquiring the lock in exclusive mode, to see if they match the previous values. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/07/2014 05:52 PM, Michael Paquier wrote: > On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus wrote: >> On 08/07/2014 08:38 AM, Oleg Bartunov wrote: >>> +1 for BRIN ! >>> >>> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs wrote: On 7 August 2014 14:53, Robert Haas wrote: A better description would be "block range index" since we are indexing a range of blocks (not just one block). Perhaps a better one would be simply "range index", which we could abbreviate to RIN or BRIN. >> >> How about Block Range Dynamic indexes? >> >> Or Range Usage Metadata indexes? >> >> You see what I'm getting at: >> >> BRanDy >> >> RUM >> >> ... to keep with our "new indexes" naming scheme ... > Not the best fit for kids, fine for grad students. But, it goes perfectly with our GIN and VODKA indexes. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Aug 7, 2014 at 7:58 AM, Robert Haas wrote: > range index might get confused with range types; block range index > seems better. I like summary, but I'm fine with block range index or > block filter index, too. +1 -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus wrote: > On 08/07/2014 08:38 AM, Oleg Bartunov wrote: >> +1 for BRIN ! >> >> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs wrote: >>> On 7 August 2014 14:53, Robert Haas wrote: >>> A better description would be "block range index" since we are >>> indexing a range of blocks (not just one block). Perhaps a better one >>> would be simply "range index", which we could abbreviate to RIN or >>> BRIN. > > How about Block Range Dynamic indexes? > > Or Range Usage Metadata indexes? > > You see what I'm getting at: > > BRanDy > > RUM > > ... to keep with our "new indexes" naming scheme ... Not the best fit for kids, fine for grad students. BRIN seems to be a perfect consensus, so +1 for it. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/07/2014 08:38 AM, Oleg Bartunov wrote: > +1 for BRIN ! > > On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs wrote: >> On 7 August 2014 14:53, Robert Haas wrote: >> A better description would be "block range index" since we are >> indexing a range of blocks (not just one block). Perhaps a better one >> would be simply "range index", which we could abbreviate to RIN or >> BRIN. How about Block Range Dynamic indexes? Or Range Usage Metadata indexes? You see what I'm getting at: BRanDy RUM ... to keep with our "new indexes" naming scheme ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 07/08/14 16:16, Simon Riggs wrote: A better description would be "block range index" since we are indexing a range of blocks (not just one block). Perhaps a better one would be simply "range index", which we could abbreviate to RIN or BRIN. +1 for block range index (BRIN) -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
2014-08-07 Oleg Bartunov : > +1 for BRIN ! +1, rolls off the tongue smoothly and captures the essence :-). Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
+1 for BRIN ! On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs wrote: > On 7 August 2014 14:53, Robert Haas wrote: >> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier >> wrote: >>> 2014-08-06 Claudio Freire : >>> So, I like blockfilter a lot. I change my vote to blockfilter ;) >>> >>> +1 for blockfilter, because it stresses the fact that the "physical" >>> arrangement of rows in blocks matters for this index. >> >> I don't like that quite as well as summary, but I'd prefer either to >> the current naming. > > Yes, "summary index" isn't good. I'm not sure where the block or the > filter part comes in though, so -1 to "block filter", not least > because it doesn't have a good abbreviation (bfin??). > > A better description would be "block range index" since we are > indexing a range of blocks (not just one block). Perhaps a better one > would be simply "range index", which we could abbreviate to RIN or > BRIN. > > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Aug 7, 2014 at 10:16 AM, Simon Riggs wrote: > On 7 August 2014 14:53, Robert Haas wrote: >> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier >> wrote: >>> 2014-08-06 Claudio Freire : >>> So, I like blockfilter a lot. I change my vote to blockfilter ;) >>> >>> +1 for blockfilter, because it stresses the fact that the "physical" >>> arrangement of rows in blocks matters for this index. >> >> I don't like that quite as well as summary, but I'd prefer either to >> the current naming. > > Yes, "summary index" isn't good. I'm not sure where the block or the > filter part comes in though, so -1 to "block filter", not least > because it doesn't have a good abbreviation (bfin??). > > A better description would be "block range index" since we are > indexing a range of blocks (not just one block). Perhaps a better one > would be simply "range index", which we could abbreviate to RIN or > BRIN. range index might get confused with range types; block range index seems better. I like summary, but I'm fine with block range index or block filter index, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Simon Riggs wrote: > On 7 August 2014 14:53, Robert Haas wrote: > > On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier > > wrote: > >> 2014-08-06 Claudio Freire : > >> > >>> So, I like blockfilter a lot. I change my vote to blockfilter ;) > >> > >> +1 for blockfilter, because it stresses the fact that the "physical" > >> arrangement of rows in blocks matters for this index. > > > > I don't like that quite as well as summary, but I'd prefer either to > > the current naming. > > Yes, "summary index" isn't good. I'm not sure where the block or the > filter part comes in though, so -1 to "block filter", not least > because it doesn't have a good abbreviation (bfin??). I was thinking just "blockfilter" (I did show a sample command). Claudio explained the name downthread; personally, of all the options suggested thus far, it's the one I like the most (including minmax). At this point, the naming issue is what is keeping me from committing this patch, so the quicker we can solve it, the merrier. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Aug 7, 2014 at 11:16 AM, Simon Riggs wrote: > On 7 August 2014 14:53, Robert Haas wrote: >> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier >> wrote: >>> 2014-08-06 Claudio Freire : >>> So, I like blockfilter a lot. I change my vote to blockfilter ;) >>> >>> +1 for blockfilter, because it stresses the fact that the "physical" >>> arrangement of rows in blocks matters for this index. >> >> I don't like that quite as well as summary, but I'd prefer either to >> the current naming. > > Yes, "summary index" isn't good. I'm not sure where the block or the > filter part comes in though, so -1 to "block filter", not least > because it doesn't have a good abbreviation (bfin??). Block filter would refer to the index property that selects blocks, not tuples, and it does so through a "filter function" (for min-max, it's a range check, but for other opclasses it could be anything). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 7 August 2014 14:53, Robert Haas wrote: > On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier > wrote: >> 2014-08-06 Claudio Freire : >> >>> So, I like blockfilter a lot. I change my vote to blockfilter ;) >> >> +1 for blockfilter, because it stresses the fact that the "physical" >> arrangement of rows in blocks matters for this index. > > I don't like that quite as well as summary, but I'd prefer either to > the current naming. Yes, "summary index" isn't good. I'm not sure where the block or the filter part comes in though, so -1 to "block filter", not least because it doesn't have a good abbreviation (bfin??). A better description would be "block range index" since we are indexing a range of blocks (not just one block). Perhaps a better one would be simply "range index", which we could abbreviate to RIN or BRIN. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier wrote: > 2014-08-06 Claudio Freire : > >> So, I like blockfilter a lot. I change my vote to blockfilter ;) > > +1 for blockfilter, because it stresses the fact that the "physical" > arrangement of rows in blocks matters for this index. I don't like that quite as well as summary, but I'd prefer either to the current naming. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
2014-08-06 Claudio Freire : > So, I like blockfilter a lot. I change my vote to blockfilter ;) +1 for blockfilter, because it stresses the fact that the "physical" arrangement of rows in blocks matters for this index. Nicolas -- A. Because it breaks the logical sequence of discussion. Q. Why is top posting bad? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Aug 6, 2014 at 1:55 PM, Alvaro Herrera wrote: > Claudio Freire wrote: >> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera >> wrote: >> > CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"? >> > CREATE INDEX foo ON t USING comprange (cols) >> > CREATE INDEX foo ON t USING compressedrng (cols) -- ugh >> > -- or use an identifier with whitespace: >> > CREATE INDEX foo ON t USING "compressed range" (cols) >> >> The word you'd use there is not necessarily the one you use on the >> framework, since the framework applies to many such techniques, but >> the index type there is one specific one. >> >> The create command can still use minmax, or rangemap if you prefer >> that, while the framework's code uses summary or summarizing. > > I think you're confusing the AM name with the opclass name. The name > you specify in that part of the command is the access method name. You > can specify the opclass together with each column, like so: > > CREATE INDEX foo ON t USING blockfilter > (order_date date_minmax_ops, geometry gis_bbox_ops); Oh, uh... no, I'm not confusing them, but now I just realized how one would implement other classes of block filtering indexes, and yeah... you do it through the opclasses. I'm sticking to bloom filters: CREATE INDEX foo ON t USING blockfilter (order_date date_minmax_ops, path character_bloom_ops); Cool. Very cool. So, I like blockfilter a lot. I change my vote to blockfilter ;) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Aug 6, 2014 at 1:35 PM, Bruce Momjian wrote: > On Wed, Aug 6, 2014 at 01:31:14PM -0300, Claudio Freire wrote: >> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera >> wrote: >> > CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"? >> > CREATE INDEX foo ON t USING comprange (cols) >> > CREATE INDEX foo ON t USING compressedrng (cols) -- ugh >> > -- or use an identifier with whitespace: >> > CREATE INDEX foo ON t USING "compressed range" (cols) >> >> >> The word you'd use there is not necessarily the one you use on the >> framework, since the framework applies to many such techniques, but >> the index type there is one specific one. > > "Block filter" indexes? Nice one -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera wrote: > CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"? > CREATE INDEX foo ON t USING comprange (cols) > CREATE INDEX foo ON t USING compressedrng (cols) -- ugh > -- or use an identifier with whitespace: > CREATE INDEX foo ON t USING "compressed range" (cols) The word you'd use there is not necessarily the one you use on the framework, since the framework applies to many such techniques, but the index type there is one specific one. The create command can still use minmax, or rangemap if you prefer that, while the framework's code uses summary or summarizing. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Aug 6, 2014 at 01:31:14PM -0300, Claudio Freire wrote: > On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera > wrote: > > CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"? > > CREATE INDEX foo ON t USING comprange (cols) > > CREATE INDEX foo ON t USING compressedrng (cols) -- ugh > > -- or use an identifier with whitespace: > > CREATE INDEX foo ON t USING "compressed range" (cols) > > > The word you'd use there is not necessarily the one you use on the > framework, since the framework applies to many such techniques, but > the index type there is one specific one. "Block filter" indexes? > The create command can still use minmax, or rangemap if you prefer > that, while the framework's code uses summary or summarizing. "Summary" sounds like materialized views to me. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera wrote: > "Summary" seems good. If I get enough votes I can change it to that. > > CREATE INDEX foo ON t USING summary (cols) > > "Summarizing" seems weird on that command. Not sure about "compressed > range", as you would have to use an abbreviation or run the words > together. Summarizing index sounds better to my ears, but both ideas based on "summary" are quite succint and to-the-point descriptions of what's happening, so I vote for those. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Tue, Aug 5, 2014 at 7:55 PM, Josh Berkus wrote: > On 08/05/2014 04:41 PM, Alvaro Herrera wrote: >> I have chosen to keep the name "minmax", even if the opclasses now let >> one implement completely different things on top of it such as geometry >> bounding boxes and bloom filters (aka bitmap indexes). I don't see a >> need for a rename: essentially, in PR we can just say "we have these >> neat minmax indexes that other databases also have, but instead of just >> being used for integer data, they can also be used for geometry, GIS and >> bitmap indexes, so as always we're more powerful than everyone else when >> implementing new database features". > > Plus we haven't come up with a better name ... Several good suggestions have been made, like "summarizing" or "summary" indexes and "compressed range" indexes. I still really dislike the present name - you might think this is a type of index that has something to do with optimizing "min" and "max", but what it really is is a kind of small index for a big table. The current name couldn't make that less clear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
FWIW I think I haven't responded appropriately to the points raised by Heikki. Basically, as I see it there are three main items: 1. the revmap physical-to-logical mapping is too complex; let's use something else. We had revmap originally in a separate fork. The current approach grew out of the necessity of putting it in the main fork while ensuring that fast access to individual pages is possible. There are of course many ways to skin this cat; Heikki's proposal is to have it always occupy the first few physical pages, rather than require a logical-to-physical mapping table. To implement this he proposes to move other pages out of the way as the index grows. I don't really have much love for this idea. We can change how this is implemented later in the cycle, if we find that a different approach is better than my proposal. I don't want to spend endless time meddling with this (and I definitely don't want to have this delay the eventual commit of the patch.) 2. vacuuming is not optimal Right now, to eliminate garbage index tuples we need to first scan the revmap to figure out which tuples are unreferenced. There is a concern that if there's an excess of dead tuples, the index becomes unvacuumable because palloc() fails due to request size. This is largely theoretical because in order for this to happen there need to be several million dead index tuples. As a minimal fix to alleviate this problem without requiring a complete rework of vacuuming, we can cap that palloc request to maintenance_work_mem and remove dead tuples in a loop instead of trying to remove all of them in a single pass. Another thing proposed was to store range numbers (or just heap page numbers) within each index tuple. I felt that this would add more bloat unnecessarily. However, there is some padding space in index tuple that maybe we can use to store range numbers. I will think some more about how we can use this to simplify vacuuming. 3. avoid MMTuple as it is just unnecessary extra complexity. The main thing that MMTuple adds is not the fact that we save 2 bytes by storing BlockNumber as is instead of within a TID field. Instead, it's that we can construct and deconstruct using our own design, which means we can use however many Datum entries we want and however many "null" flags. In normal heap and index tuples, there are always the same number of datum/nulls. In minmax, the number of nulls is twice the number of indexed columns; the number of datum values is determined by how many datum values are stored per opclass ("sortable" opclasses store 2 columns, but geometry would store only one). If we were to use regular IndexTuples, we would lose that .. and I have no idea how it would work. In other words, MMTuples look fine to me. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 08/05/2014 04:41 PM, Alvaro Herrera wrote: > I have chosen to keep the name "minmax", even if the opclasses now let > one implement completely different things on top of it such as geometry > bounding boxes and bloom filters (aka bitmap indexes). I don't see a > need for a rename: essentially, in PR we can just say "we have these > neat minmax indexes that other databases also have, but instead of just > being used for integer data, they can also be used for geometry, GIS and > bitmap indexes, so as always we're more powerful than everyone else when > implementing new database features". Plus we haven't come up with a better name ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 07/10/2014 12:41 AM, Alvaro Herrera wrote: Heikki Linnakangas wrote: On 06/23/2014 08:07 PM, Alvaro Herrera wrote: I feel that the below would nevertheless be simpler: I wonder if it would be simpler to just always store the revmap pages in the beginning of the index, before any other pages. Finding the revmap page would then be just as easy as with a separate fork. When the table/index is extended so that a new revmap page is needed, move the existing page at that block out of the way. Locking needs some consideration, but I think it would be feasible and simpler than you have now. Moving index items around is not easy, because you'd have to adjust the revmap to rewrite the item pointers. Hmm. Two alternative schemes come to mind: 1. Move each index tuple off the page individually, updating the revmap while you do it, until the page is empty. Updating the revmap for a single index tuple isn't difficult; you have to do it anyway when an index tuple is replaced. (MMTuples don't contain a heap block number ATM, but IMHO they should, see below) 2. Store the new block number of the page that you moved out of the way in the revmap page, and leave the revmap pointers unchanged. The revmap pointers can be updated later, lazily. Both of those seem pretty straightforward. The trouble I have with moving blocks around to make space, is that it would cause the index to have periodic hiccups to make room for the new revmap pages. One nice property that these indexes are supposed to have is that the effect into insertion times should be pretty minimal. That would cease to be the case if we have to do your proposed block moves. Approach 2 above is fairly quick, quick enough that no-one would notice the "hiccup". Moving the tuples individually (approach 1) would be slower. ISTM that when the old tuple cannot be updated in-place, the new index tuple is inserted with mm_doinsert(), but the old tuple is never deleted. It's deleted by the next vacuum. Ah I see. Vacuum reads the whole index, and builds an in-memory hash table that contains an ItemPointerData for every tuple in the index. Doesn't that require a lot of memory, for a large index? That might be acceptable - you ought to have plenty of RAM if you're pushing around multi-terabyte tables - but it would nevertheless be nice to not have a hard requirement for something as essential as vacuum. I guess if you're expecting that pages_per_range=1 is a common case, yeah it might become an issue eventually. Not sure, but I find it easier to think of the patch that way. In any case, it would be nice to avoid the problem, even if it's not common. One idea I just had is to have a bit for each index tuple, which is set whenever the revmap no longer points to it. That way, vacuuming is much easier: just scan the index and delete all tuples having that bit set. The bit needs to be set atomically with the insertion of the new tuple, so why not just remove the old tuple right away? Wouldn't it be simpler to remove the old tuple atomically with inserting the new tuple and updating the revmap? Or at least mark the old tuple as deletable, so that vacuum can just delete it, without building the large hash table to determine that it's deletable. Yes, it might be simpler, but it'd require dirtying more pages on insertions (and holding more page-level locks, for longer. Not good for concurrent access). I wouldn't worry much about the performance and concurrency of this operation. Remember that the majority of updates are expected to not have to update the index, otherwise the minmax index will degenerate quickly and performance will suck anyway. And even when updating the index is needed, in most cases the new tuple fits on the same page, after removing the old one. So the case where you have to insert a new index tuple, remove old one (or mark it dead), and update the revmap to point to the new tuple, is rare. I'm quite surprised by the use of LockTuple on the index tuples. I think the main reason for needing that is the fact that MMTuple doesn't store the heap (range) block number that the tuple points to: LockTuple is required to ensure that the tuple doesn't go away while a scan is following a pointer from the revmap to it. If the MMTuple contained the BlockNumber, a scan could check that and go back to the revmap if it doesn't match. Alternatively, you could keep the revmap page locked when you follow a pointer to the regular index page. There's the intention that these accesses be kept as concurrent as possible; this is why we don't want to block the whole page. Locking individual TIDs is fine in this case (which is not in SELECT FOR UPDATE) because we can only lock a single tuple in any one index scan, so there's no unbounded growth of the lock table. I prefer not to have BlockNumbers in index tuples, because that would make them larger for not much gain. That data would mostly be redundant, and would be necessary on
Re: [HACKERS] Minmax indexes
On Wed, Jul 9, 2014 at 5:16 PM, Alvaro Herrera wrote: > The way it works now, each opclass needs to have three support > procedures; I've called them getOpers, maybeUpdateValues, and compare. > (I realize these names are pretty bad, and will be changing them.) > getOpers is used to obtain information about what is stored for that > data type; it says how many datum values are stored for a column of that > type (two for sortable: min and max), and how many operators it needs > setup. Then, the generic code fills in a MinmaxDesc(riptor) and creates > an initial DeformedMMTuple (which is a rather ugly name for a minmax > tuple held in memory). The maybeUpdateValues amproc can then be called > when there's a new heap tuple, which updates the DeformedMMTuple to > account for the new tuple (in essence, it's a union of the original > values and the new tuple). This can be done repeatedly (when a new > index is being created) or only once (when a new heap tuple is inserted > into an existing index). There is no need for an "aggregate". > > This DeformedMMTuple can easily be turned into the on-disk > representation; there is no hardcoded assumption on the number of index > values stored per heap column, so it is possible to build an opclass > that stores a bounding box column for a geometry heap column, for > instance. > > Then we have the "compare" amproc. This is used during index scans; > after extracting an index tuple, it is turned into DeformedMMTuple, and > the "compare" amproc for each column is called with the values of scan > keys. (Now that I think about this, it seems pretty much what > "consistent" is for GiST opclasses). A true return value indicates that > the scan key matches the page range boundaries and thus all pages in the > range are added to the output TID bitmap. This sounds really great. I agree that it needs some renaming. I think renaming what you are calling "compare" to "consistent" would be an excellent idea, to match GiST. "maybeUpdateValues" sounds like it does the equivalent of GIST's "compress" on the new value followed by a "union" with the existing summary item. I don't think it's necessary to separate those out, though. You could perhaps call it something like "add_item". Also, FWIW, I liked Peter's idea of calling these "summarizing indexes" or perhaps "summary" would be a bit shorter and mean the same thing. "minmax" wouldn't be the end of the world, but since you've gone to the trouble of making this more generic I think giving it a more generic name would be a very good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Jul 11, 2014 at 3:47 PM, Greg Stark wrote: > On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire > wrote: >> Marking as read-only is ok, or emitting a NOTICE so that if anyone >> changes those parameters that change the shape of the index, they know >> it needs a rebuild would be OK too. Both mechanisms work for me. > > We don't actually have any of these mechanisms. They wouldn't be bad > things to have but I don't think we should gate adding new types of > indexes on adding them. In particular, the index could just hard code > a value for these parameters and having them be parameterized is > clearly better even if that doesn't produce all the warnings or > rebuild things automatically or whatever. No, I agree, it's just a nice to have. But at least the docs should mention it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire wrote: > Marking as read-only is ok, or emitting a NOTICE so that if anyone > changes those parameters that change the shape of the index, they know > it needs a rebuild would be OK too. Both mechanisms work for me. We don't actually have any of these mechanisms. They wouldn't be bad things to have but I don't think we should gate adding new types of indexes on adding them. In particular, the index could just hard code a value for these parameters and having them be parameterized is clearly better even if that doesn't produce all the warnings or rebuild things automatically or whatever. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Jul 10, 2014 at 4:20 PM, Alvaro Herrera wrote: > Claudio Freire wrote: >> On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera >> wrote: >> > Another thing I noticed is that version 8 of the patch blindly believed >> > the "pages_per_range" declared in catalogs. This meant that if somebody >> > did "alter index foo set pages_per_range=123" the index would >> > immediately break (i.e. return corrupted results when queried). I have >> > fixed this by storing the pages_per_range value used to construct the >> > index in the metapage. Now if you do the ALTER INDEX thing, the new >> > value is only used when the index is recreated by REINDEX. >> >> This seems a lot like parameterizing. > > I don't understand what that means -- care to elaborate? We've been talking about bloom filters, and how their shape differs according to the parameters of the bloom filter (number of hashes, hash type, etc). But after seeing this case of pages_per_range, I noticed it's an effective-enough mechanism. Like: CREATE INDEX ix_blah ON some_table USING bloom (somecol) WITH (BLOOM_HASHES=15, BLOOM_BUCKETS=1024, PAGES_PER_RANGE=64); Marking as read-only is ok, or emitting a NOTICE so that if anyone changes those parameters that change the shape of the index, they know it needs a rebuild would be OK too. Both mechanisms work for me. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Fujii Masao wrote: > On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera > wrote: > > Here's a new version of this patch, which is more generic the original > > versions, and similar to what you describe. > > I've not read the discussion so far at all, but I found the problem > when I played with this patch. Sorry if this has already been discussed. > > =# create table test as select num from generate_series(1,10) num; > SELECT 10 > =# create index testidx on test using minmax (num); > CREATE INDEX > =# alter table test alter column num type text; > ERROR: could not determine which collation to use for string comparison > HINT: Use the COLLATE clause to set the collation explicitly. Ah, yes, I need to pass down collation OIDs to comparison functions. That's marked as XXX in various places in the code. Sorry I forgot to mention that. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 10 July 2014 00:13, Alvaro Herrera wrote: > Josh Berkus wrote: >> On 07/09/2014 02:16 PM, Alvaro Herrera wrote: >> > The way it works now, each opclass needs to have three support >> > procedures; I've called them getOpers, maybeUpdateValues, and compare. >> > (I realize these names are pretty bad, and will be changing them.) >> >> I kind of like "maybeUpdateValues". Very ... NoSQL-ish. "Maybe update >> the values, maybe not." ;-) > > :-) Well, that's exactly what happens. If we insert a new tuple into > the table, and the existing summarizing tuple (to use Peter's term) > already covers it, then we don't need to update the index tuple at all. > What this name doesn't say is what values are to be maybe-updated. There are lots of functions that maybe-do-things, that's just modular programming. Not sure we need to prefix things with maybe to explain that, otherwise we'd have maybeXXX everywhere. More descriptive name would be MaintainIndexBounds() or similar. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 9 July 2014 23:54, Peter Geoghegan wrote: > On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera > wrote: >> All this being said, I'm sticking to the name "Minmax indexes". There >> was a poll in pgsql-advocacy >> http://www.postgresql.org/message-id/53a0b4f8.8080...@agliodbs.com >> about a new name, but there were no suggestions supported by more than >> one person. If a brilliant new name comes up, I'm open to changing it. > > How about "summarizing indexes"? That seems reasonably descriptive. -1 for another name change. That boat sailed some months back. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera wrote: > Claudio Freire wrote: > >> An aggregate to generate a "compressed set" from several values >> A function which adds a new value to the "compressed set" and returns >> the new "compressed set" >> A function which tests if a value is in a "compressed set" >> A function which tests if a "compressed set" overlaps another >> "compressed set" of equal type >> >> If you can define different compressed sets, you can use this to >> generate both min/max indexes as well as bloom filter indexes. Whether >> we'd want to have both is perhaps questionable, but having the ability >> to is probably desirable. > > Here's a new version of this patch, which is more generic the original > versions, and similar to what you describe. I've not read the discussion so far at all, but I found the problem when I played with this patch. Sorry if this has already been discussed. =# create table test as select num from generate_series(1,10) num; SELECT 10 =# create index testidx on test using minmax (num); CREATE INDEX =# alter table test alter column num type text; ERROR: could not determine which collation to use for string comparison HINT: Use the COLLATE clause to set the collation explicitly. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Jul 10, 2014 at 10:29 PM, Alvaro Herrera wrote: > > What I think should happen is that if the value is changed, the index > sholud be rebuilt right there. I disagree. It would be a non-orthogonal interface if ALTER TABLE sometimes causes the index to be rebuilt and sometimes just makes a configuration change. I already see a lot of user confusion when some ALTER TABLE commands rewrite the table and some are quick meta data changes. Especially in this case where the type of configuration being changed is just an internal storage parameter and the user visible shape of the index is unchanged it would be weird to rebuild the index. IMHO the "right" thing to do is just to say this parameter is read-only and have the AM throw an error when the user changes it. But even that would require an AM callback for the AM to even know about the change. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Jul 10, 2014 at 2:30 PM, Jaime Casanova wrote: > On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus wrote: > > On 07/10/2014 12:20 PM, Alvaro Herrera wrote: > >>> So I guess the only thing left is to issue a NOTICE when said alter > >>> > takes place (I don't see that on the patch, but maybe it's there?) > >> That's not in the patch. I don't think we have an appropriate place to > >> emit such a notice. > > > > What do you mean by "don't have an appropriate place"? > > > > The suggestion is that when a user does: > > > > ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100 > > > > they should get a NOTICE: > > > > "NOTICE: changes to pages per range will not take effect until the index > > is REINDEXed" > > > > otherwise, we're going to get a lot of "I Altered the pages per range, > > but performance didn't change" emails. > > > > How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or > from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ? > > we don't get a notice for these cases either > I think those are different. They don't rewrite existing data in the table, but they are applied to new (and updated) data. My understanding is that changing PAGES_PER_RANGE will have no effect on future data until a re-index is done, even if the entire table eventually turns over. Cheers, Jeff
Re: [HACKERS] Minmax indexes
On 07/10/2014 02:30 PM, Jaime Casanova wrote: > How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or > from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ? > > we don't get a notice for these cases either Good idea. We should also emit notices for those. Well, maybe not for fillfactor. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Josh Berkus wrote: > On 07/10/2014 12:20 PM, Alvaro Herrera wrote: > >> So I guess the only thing left is to issue a NOTICE when said alter > >> > takes place (I don't see that on the patch, but maybe it's there?) > > That's not in the patch. I don't think we have an appropriate place to > > emit such a notice. > > What do you mean by "don't have an appropriate place"? What I think should happen is that if the value is changed, the index sholud be rebuilt right there. But there is no way to have this occur from the generic tablecmds.c code. Maybe we should extend the AM interface so that they are notified of changes and can take action. Inserting AM-specific code into tablecmds.c seems pretty wrong to me -- existing stuff for WITH CHECK OPTION views notwithstanding. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus wrote: > On 07/10/2014 12:20 PM, Alvaro Herrera wrote: >>> So I guess the only thing left is to issue a NOTICE when said alter >>> > takes place (I don't see that on the patch, but maybe it's there?) >> That's not in the patch. I don't think we have an appropriate place to >> emit such a notice. > > What do you mean by "don't have an appropriate place"? > > The suggestion is that when a user does: > > ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100 > > they should get a NOTICE: > > "NOTICE: changes to pages per range will not take effect until the index > is REINDEXed" > > otherwise, we're going to get a lot of "I Altered the pages per range, > but performance didn't change" emails. > How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ? we don't get a notice for these cases either -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación Phone: +593 4 5107566 Cell: +593 987171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 07/10/2014 12:20 PM, Alvaro Herrera wrote: >> So I guess the only thing left is to issue a NOTICE when said alter >> > takes place (I don't see that on the patch, but maybe it's there?) > That's not in the patch. I don't think we have an appropriate place to > emit such a notice. What do you mean by "don't have an appropriate place"? The suggestion is that when a user does: ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100 they should get a NOTICE: "NOTICE: changes to pages per range will not take effect until the index is REINDEXed" otherwise, we're going to get a lot of "I Altered the pages per range, but performance didn't change" emails. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Claudio Freire wrote: > On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera > wrote: > > Another thing I noticed is that version 8 of the patch blindly believed > > the "pages_per_range" declared in catalogs. This meant that if somebody > > did "alter index foo set pages_per_range=123" the index would > > immediately break (i.e. return corrupted results when queried). I have > > fixed this by storing the pages_per_range value used to construct the > > index in the metapage. Now if you do the ALTER INDEX thing, the new > > value is only used when the index is recreated by REINDEX. > > This seems a lot like parameterizing. I don't understand what that means -- care to elaborate? > So I guess the only thing left is to issue a NOTICE when said alter > takes place (I don't see that on the patch, but maybe it's there?) That's not in the patch. I don't think we have an appropriate place to emit such a notice. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera wrote: > Another thing I noticed is that version 8 of the patch blindly believed > the "pages_per_range" declared in catalogs. This meant that if somebody > did "alter index foo set pages_per_range=123" the index would > immediately break (i.e. return corrupted results when queried). I have > fixed this by storing the pages_per_range value used to construct the > index in the metapage. Now if you do the ALTER INDEX thing, the new > value is only used when the index is recreated by REINDEX. This seems a lot like parameterizing. So I guess the only thing left is to issue a NOTICE when said alter takes place (I don't see that on the patch, but maybe it's there?) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Jul 9, 2014 at 10:16 PM, Alvaro Herrera wrote: > there is no hardcoded assumption on the number of index > values stored per heap column, so it is possible to build an opclass > that stores a bounding box column for a geometry heap column, for > instance. I think the more Postgresy thing to do is to store one datum per heap column. It's up to the opclass to find or make a composite data type that stores all the necessary state. So you could make a minmax_accum data type like NumericAggState in numeric.c:numeric_accum() or the array of floats in float8_accum. For a bounding box a 2d geometric min/max index could use the "box" data type for example. The way you've done it seems more convenient but there's something to be said for using the same style for different areas. A single bounding box accumulator function would probably suffice for both an aggregate and index opclass for example. But this sounds pretty great. I think it would let me do the bloom filter index I had in mind fairly straightforwardly. The result would be something very similar to a bitmap index. I'm not sure if there's a generic term that includes bitmap indexes or other summary functions like bounding boxes (which min/max is basically -- a 1D bounding box). Thanks a lot for listening and being so open, I think what you describe is a lot more flexible than what you had before and I can see some pretty great things coming out of it (including min/max itself of course). -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Josh Berkus wrote: > On 07/09/2014 02:16 PM, Alvaro Herrera wrote: > > The way it works now, each opclass needs to have three support > > procedures; I've called them getOpers, maybeUpdateValues, and compare. > > (I realize these names are pretty bad, and will be changing them.) > > I kind of like "maybeUpdateValues". Very ... NoSQL-ish. "Maybe update > the values, maybe not." ;-) :-) Well, that's exactly what happens. If we insert a new tuple into the table, and the existing summarizing tuple (to use Peter's term) already covers it, then we don't need to update the index tuple at all. What this name doesn't say is what values are to be maybe-updated. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera wrote: > All this being said, I'm sticking to the name "Minmax indexes". There > was a poll in pgsql-advocacy > http://www.postgresql.org/message-id/53a0b4f8.8080...@agliodbs.com > about a new name, but there were no suggestions supported by more than > one person. If a brilliant new name comes up, I'm open to changing it. How about "summarizing indexes"? That seems reasonably descriptive. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 07/09/2014 02:16 PM, Alvaro Herrera wrote: > The way it works now, each opclass needs to have three support > procedures; I've called them getOpers, maybeUpdateValues, and compare. > (I realize these names are pretty bad, and will be changing them.) I kind of like "maybeUpdateValues". Very ... NoSQL-ish. "Maybe update the values, maybe not." ;-) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Heikki Linnakangas wrote: > On 06/23/2014 08:07 PM, Alvaro Herrera wrote: > I feel that the below would nevertheless be simpler: > > >>I wonder if it would be simpler to just always store the revmap > >>pages in the beginning of the index, before any other pages. Finding > >>the revmap page would then be just as easy as with a separate fork. > >>When the table/index is extended so that a new revmap page is > >>needed, move the existing page at that block out of the way. Locking > >>needs some consideration, but I think it would be feasible and > >>simpler than you have now. > > > >Moving index items around is not easy, because you'd have to adjust the > >revmap to rewrite the item pointers. > > Hmm. Two alternative schemes come to mind: > > 1. Move each index tuple off the page individually, updating the > revmap while you do it, until the page is empty. Updating the revmap > for a single index tuple isn't difficult; you have to do it anyway > when an index tuple is replaced. (MMTuples don't contain a heap > block number ATM, but IMHO they should, see below) > > 2. Store the new block number of the page that you moved out of the > way in the revmap page, and leave the revmap pointers unchanged. The > revmap pointers can be updated later, lazily. > > Both of those seem pretty straightforward. The trouble I have with moving blocks around to make space, is that it would cause the index to have periodic hiccups to make room for the new revmap pages. One nice property that these indexes are supposed to have is that the effect into insertion times should be pretty minimal. That would cease to be the case if we have to do your proposed block moves. > >>ISTM that when the old tuple cannot be updated in-place, the new > >>index tuple is inserted with mm_doinsert(), but the old tuple is > >>never deleted. > > > >It's deleted by the next vacuum. > > Ah I see. Vacuum reads the whole index, and builds an in-memory hash > table that contains an ItemPointerData for every tuple in the index. > Doesn't that require a lot of memory, for a large index? That might > be acceptable - you ought to have plenty of RAM if you're pushing > around multi-terabyte tables - but it would nevertheless be nice to > not have a hard requirement for something as essential as vacuum. I guess if you're expecting that pages_per_range=1 is a common case, yeah it might become an issue eventually. One idea I just had is to have a bit for each index tuple, which is set whenever the revmap no longer points to it. That way, vacuuming is much easier: just scan the index and delete all tuples having that bit set. No need for this hash table stuff. I am still concerned with adding more overhead whenever a page range is modified, so that insertions in the table continue to be fast. If we're going to dirty the index every time, it might not be so fast anymore. But then maybe I'm worrying about nothing; I will have to measure how slower it is. > Wouldn't it be simpler to remove the old tuple atomically with > inserting the new tuple and updating the revmap? Or at least mark > the old tuple as deletable, so that vacuum can just delete it, > without building the large hash table to determine that it's > deletable. Yes, it might be simpler, but it'd require dirtying more pages on insertions (and holding more page-level locks, for longer. Not good for concurrent access). > I'm quite surprised by the use of LockTuple on the index tuples. I > think the main reason for needing that is the fact that MMTuple > doesn't store the heap (range) block number that the tuple points > to: LockTuple is required to ensure that the tuple doesn't go away > while a scan is following a pointer from the revmap to it. If the > MMTuple contained the BlockNumber, a scan could check that and go > back to the revmap if it doesn't match. Alternatively, you could > keep the revmap page locked when you follow a pointer to the regular > index page. There's the intention that these accesses be kept as concurrent as possible; this is why we don't want to block the whole page. Locking individual TIDs is fine in this case (which is not in SELECT FOR UPDATE) because we can only lock a single tuple in any one index scan, so there's no unbounded growth of the lock table. I prefer not to have BlockNumbers in index tuples, because that would make them larger for not much gain. That data would mostly be redundant, and would be necessary only for vacuuming. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Jun 19, 2014 at 12:32 PM, Greg Stark wrote: > On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas > wrote: >> Implementing something is a good way to demonstrate how it would look like. >> But no, I don't insist on implementing every possible design whenever a new >> feature is proposed. >> >> I liked Greg's sketch of what the opclass support functions would be. It >> doesn't seem significantly more complicated than what's in the patch now. > > As a counter-point to my own point there will be nothing stopping us > in the future from generalizing things. Dealing with catalogs is > mostly book-keeping headaches and careful work. it's something that > might be well-suited for a GSOC or first patch from someone looking to > familiarize themselves with the system architecture. It's hard to > invent a whole new underlying infrastructure at the same time as > dealing with all that book-keeping and it's hard for someone > familiarizing themselves with the system to also have a great new > idea. Having tasks like this that are easy to explain and that mentor > understands well can be easier to manage than tasks where the newcomer > has some radical new idea. Generalizing this in the future would be highly likely to change the on-disk format for existing indexes, which would be a problem for pg_upgrade. I think we will likely be stuck with whatever the initial on-disk format looks like for a very long time, which is why I think we need to try rather hard to get this right the first time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 06/23/2014 08:07 PM, Alvaro Herrera wrote: Heikki Linnakangas wrote: With 8k block size, that's just enough to cover the full range of 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each regular revmap page can store about 8192/6 = 1365 item pointers, each array revmap page can store about 8192/4 = 2048 block references, and the size of the top array is 8192/4. That's just enough; to store the required number of array pages in the top array, the array needs to be (2^32/1365)/2048)=1536 elements large. But with 4k or smaller blocks, it's not enough. Yeah. As I said elsewhere, actual useful values are likely to be close to the read-ahead setting of the underlying disk; by default that'd be 16 pages (128kB), but I think it's common advice to increase the kernel setting to improve performance. My gut feeling is that it might well be best to set pages_per_page=1. Even if you do the same amount of I/O, thanks to kernel read-ahead, you might still avoid processing a lot of tuples. But would need to see some benchmarks to know.. I don't think we don't need to prevent minmax indexes with pages_per_range=1, but I don't think we need to ensure that that setting works with the largest tables, either, because it doesn't make any sense to set it up like that. Also, while there are some recommendations to set up a system with larger page sizes (32kB), I have never seen any recommendation to set it lower. It wouldn't make sense to build a system that has very large tables and use a smaller page size. So in other words, yes, you're correct that the mechanism doesn't work in some cases (small page size and index configured for highest level of detail), but the conditions are such that I don't think it matters. ISTM the thing to do here is to do the math at index creation time, and if we find that we don't have enough space in the metapage for all array revmap pointers we need, bail out and require the index to be created with a larger pages_per_range setting. Yeah, I agree that would be acceptable. I feel that the below would nevertheless be simpler: I wonder if it would be simpler to just always store the revmap pages in the beginning of the index, before any other pages. Finding the revmap page would then be just as easy as with a separate fork. When the table/index is extended so that a new revmap page is needed, move the existing page at that block out of the way. Locking needs some consideration, but I think it would be feasible and simpler than you have now. Moving index items around is not easy, because you'd have to adjust the revmap to rewrite the item pointers. Hmm. Two alternative schemes come to mind: 1. Move each index tuple off the page individually, updating the revmap while you do it, until the page is empty. Updating the revmap for a single index tuple isn't difficult; you have to do it anyway when an index tuple is replaced. (MMTuples don't contain a heap block number ATM, but IMHO they should, see below) 2. Store the new block number of the page that you moved out of the way in the revmap page, and leave the revmap pointers unchanged. The revmap pointers can be updated later, lazily. Both of those seem pretty straightforward. I have followed the suggestion by Amit to overwrite the index tuple when a new heap tuple is inserted, instead of creating a separate index tuple. This saves a lot of index bloat. This required a new entry point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new function PageIndexDeleteNoCompact which is similar in spirit to PageIndexMultiDelete except that item pointers do not change. This is necessary because the revmap stores item pointers, and such reference would break if we were to renumber items in index pages. ISTM that when the old tuple cannot be updated in-place, the new index tuple is inserted with mm_doinsert(), but the old tuple is never deleted. It's deleted by the next vacuum. Ah I see. Vacuum reads the whole index, and builds an in-memory hash table that contains an ItemPointerData for every tuple in the index. Doesn't that require a lot of memory, for a large index? That might be acceptable - you ought to have plenty of RAM if you're pushing around multi-terabyte tables - but it would nevertheless be nice to not have a hard requirement for something as essential as vacuum. In addition to the hash table, remove_deletable_tuples() pallocs an array to hold an ItemPointer for every index tuple about to be removed. A single palloc is limited to 1GB, so that will fail outright if there are more than ~179 million dead index tuples. You're unlikely to hit that in practice, but if you do, you'll never be able to vacuum the index. So that's not very nice. Wouldn't it be simpler to remove the old tuple atomically with inserting the new tuple and updating the revmap? Or at least mark the old tuple as deletable, so that vacuum can just delete it, without building the large hash t
Re: [HACKERS] Minmax indexes
Heikki Linnakangas wrote: > Some comments, aside from the design wrt. bounding boxes etc. : Thanks. I haven't commented on that sub-thread because I think it's possible to come up with a reasonable design that solves the issue by adding a couple of amprocs. I need to do some more thinking to ensure it is really workable, and then I'll post my ideas. > On 06/15/2014 05:34 AM, Alvaro Herrera wrote: > >Robert Haas wrote: > If I understand the code correctly, the revmap is a three-level deep > structure. The bottom level consists of "regular revmap pages", and > each regular revmap page is filled with ItemPointerDatas, which > point to the index tuples. The middle level consists of "array > revmap pages", and each array revmap page contains an array of > BlockNumbers of the "regular revmap" pages. The top level is an > array of BlockNumbers of the array revmap pages, and it is stored in > the metapage. Yep, that's correct. Essentially, we still have the revmap as a linear space (containing TIDs); the other two levels on top of that are only there to enable locating the physical page numbers for each revmap logical page. I make one exception that the first logical revmap page is always stored in page 1, to optimize the case of a smallish table (~1360 page ranges; approximately 1.3 gigabytes of data at 128 pages per range, or 170 megabytes at 16 pages per range.) Each page has a page header (24 bytes) and special space (4 bytes), so it has 8192-28=8164 bytes available for data, so 1360 item pointers. > With 8k block size, that's just enough to cover the full range of > 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each > regular revmap page can store about 8192/6 = 1365 item pointers, > each array revmap page can store about 8192/4 = 2048 block > references, and the size of the top array is 8192/4. That's just > enough; to store the required number of array pages in the top > array, the array needs to be (2^32/1365)/2048)=1536 elements large. > > But with 4k or smaller blocks, it's not enough. Yeah. As I said elsewhere, actual useful values are likely to be close to the read-ahead setting of the underlying disk; by default that'd be 16 pages (128kB), but I think it's common advice to increase the kernel setting to improve performance. I don't think we don't need to prevent minmax indexes with pages_per_range=1, but I don't think we need to ensure that that setting works with the largest tables, either, because it doesn't make any sense to set it up like that. Also, while there are some recommendations to set up a system with larger page sizes (32kB), I have never seen any recommendation to set it lower. It wouldn't make sense to build a system that has very large tables and use a smaller page size. So in other words, yes, you're correct that the mechanism doesn't work in some cases (small page size and index configured for highest level of detail), but the conditions are such that I don't think it matters. ISTM the thing to do here is to do the math at index creation time, and if we find that we don't have enough space in the metapage for all array revmap pointers we need, bail out and require the index to be created with a larger pages_per_range setting. > I wonder if it would be simpler to just always store the revmap > pages in the beginning of the index, before any other pages. Finding > the revmap page would then be just as easy as with a separate fork. > When the table/index is extended so that a new revmap page is > needed, move the existing page at that block out of the way. Locking > needs some consideration, but I think it would be feasible and > simpler than you have now. Moving index items around is not easy, because you'd have to adjust the revmap to rewrite the item pointers. > >I have followed the suggestion by Amit to overwrite the index tuple when > >a new heap tuple is inserted, instead of creating a separate index > >tuple. This saves a lot of index bloat. This required a new entry > >point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new > >function PageIndexDeleteNoCompact which is similar in spirit to > >PageIndexMultiDelete except that item pointers do not change. This is > >necessary because the revmap stores item pointers, and such reference > >would break if we were to renumber items in index pages. > > ISTM that when the old tuple cannot be updated in-place, the new > index tuple is inserted with mm_doinsert(), but the old tuple is > never deleted. It's deleted by the next vacuum. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Some comments, aside from the design wrt. bounding boxes etc. : On 06/15/2014 05:34 AM, Alvaro Herrera wrote: Robert Haas wrote: On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera wrote: Here's an updated version of this patch, with fixes to all the bugs reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and Amit Kapila for the reports. I'm not very happy with the use of a separate relation fork for storing this data. Here's a new version of this patch. Now the revmap is not stored in a separate fork, but together with all the regular data, as explained elsewhere in the thread. Thanks! Please update the README accordingly. If I understand the code correctly, the revmap is a three-level deep structure. The bottom level consists of "regular revmap pages", and each regular revmap page is filled with ItemPointerDatas, which point to the index tuples. The middle level consists of "array revmap pages", and each array revmap page contains an array of BlockNumbers of the "regular revmap" pages. The top level is an array of BlockNumbers of the array revmap pages, and it is stored in the metapage. With 8k block size, that's just enough to cover the full range of 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each regular revmap page can store about 8192/6 = 1365 item pointers, each array revmap page can store about 8192/4 = 2048 block references, and the size of the top array is 8192/4. That's just enough; to store the required number of array pages in the top array, the array needs to be (2^32/1365)/2048)=1536 elements large. But with 4k or smaller blocks, it's not enough. I wonder if it would be simpler to just always store the revmap pages in the beginning of the index, before any other pages. Finding the revmap page would then be just as easy as with a separate fork. When the table/index is extended so that a new revmap page is needed, move the existing page at that block out of the way. Locking needs some consideration, but I think it would be feasible and simpler than you have now. I have followed the suggestion by Amit to overwrite the index tuple when a new heap tuple is inserted, instead of creating a separate index tuple. This saves a lot of index bloat. This required a new entry point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new function PageIndexDeleteNoCompact which is similar in spirit to PageIndexMultiDelete except that item pointers do not change. This is necessary because the revmap stores item pointers, and such reference would break if we were to renumber items in index pages. ISTM that when the old tuple cannot be updated in-place, the new index tuple is inserted with mm_doinsert(), but the old tuple is never deleted. - Heikki -- - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
I'm sorry if I missed something, but ISTM this is beginning to look a lot like GiST. This was pointed out by Robert Haas last year. On Wed, Jun 18, 2014 at 12:09:42PM -0300, Claudio Freire wrote: > So, you have: > > An aggregate to generate a "compressed set" from several values Which GiST does by calling 'compress' on each value, and the 'unions' the results together. > A function which adds a new value to the "compressed set" and returns > the new "compressed set" Again, 'compress' + 'union' > A function which tests if a value is in a "compressed set" Which GiST does using 'compress' +'consistant' > A function which tests if a "compressed set" overlaps another > "compressed set" of equal type Which GiST calls 'consistant' So I'm wondering why you can't just reuse the btree_gist functions we already have in contrib. It seems to me that these MinMax indexes are in fact a variation on GiST that indexes the pages of a table based upon the 'union' of all the elements in a page. By reusing the GiST operator class you get support for many datatypes for free. > If you can define different compressed sets, you can use this to > generate both min/max indexes as well as bloom filter indexes. Whether > we'd want to have both is perhaps questionable, but having the ability > to is probably desirable. You could implement bloom filter in GiST too. It's been discussed before but I can't find any implementation. Probably because the filter needs to be parameterised and if you store the bloom filter for each element it gets expensive very quickly. However, hooked into a minmax structure which only indexes whole pages it could be quite efficient. > One problem with such a generalized implementation would be, that I'm > not sure in-place modification of the "compressed set" on-disk can be > assumed to be safe on all cases. Surely, for strictly-enlarging sets > it would, but while min/max and bloom filters both fit the bill, it's > not clear that one can assume this for all structures. I think GiST has already solved this problem. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Minmax indexes
On Wed, Jun 18, 2014 at 8:51 AM, Heikki Linnakangas wrote: > > I liked Greg's sketch of what the opclass support functions would be. It > doesn't seem significantly more complicated than what's in the patch now. Which was On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark wrote: > An aggregate to generate a min/max "bounding box" from several values > A function which takes an "bounding box" and a new value and returns > the new "bounding box" > A function which tests if a value is in a "bounding box" > A function which tests if a "bounding box" overlaps a "bounding box" Which I'd generalize a bit further by renaming "bounding box" with "compressed set", and allow it to be parameterized. So, you have: An aggregate to generate a "compressed set" from several values A function which adds a new value to the "compressed set" and returns the new "compressed set" A function which tests if a value is in a "compressed set" A function which tests if a "compressed set" overlaps another "compressed set" of equal type If you can define different compressed sets, you can use this to generate both min/max indexes as well as bloom filter indexes. Whether we'd want to have both is perhaps questionable, but having the ability to is probably desirable. One problem with such a generalized implementation would be, that I'm not sure in-place modification of the "compressed set" on-disk can be assumed to be safe on all cases. Surely, for strictly-enlarging sets it would, but while min/max and bloom filters both fit the bill, it's not clear that one can assume this for all structures. Adding also a "in-place updateable" bit to the "type" would perhaps inflate the complexity of the patch due to the need to provide both code paths? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Thu, Jun 19, 2014 at 10:06 AM, Heikki Linnakangas wrote: > On 06/18/2014 06:09 PM, Claudio Freire wrote: >> >> On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark wrote: >>> >>> An aggregate to generate a min/max "bounding box" from several values >>> A function which takes an "bounding box" and a new value and returns >>> the new "bounding box" >>> A function which tests if a value is in a "bounding box" >>> A function which tests if a "bounding box" overlaps a "bounding box" >> >> >> Which I'd generalize a bit further by renaming "bounding box" with >> "compressed set", and allow it to be parameterized. > > > What do you mean by parameterized? Bloom filters can be paired with number of hashes, number of bit positions, and hash function, so it's not a simple bloom filter index, but a bloom filter index with N SHA-1-based hashes spread on a K-length bitmap. >> So, you have: >> >> An aggregate to generate a "compressed set" from several values >> A function which adds a new value to the "compressed set" and returns >> the new "compressed set" >> A function which tests if a value is in a "compressed set" >> A function which tests if a "compressed set" overlaps another >> "compressed set" of equal type > > > Yeah, something like that. I'm not sure I like the "compressed set" term any > more than bounding box, though. GiST seems to have avoided naming the thing, > and just talks about "index entries". But if we can come up with a good > name, that would be more clear. I don't want to use the term bloom filter since it's very specific of a specific technique, but it's basically that - an approximate set without false negatives. Ie: compressed set. It's not a bounding box either when using bloom filters. So... >> One problem with such a generalized implementation would be, that I'm >> not sure in-place modification of the "compressed set" on-disk can be >> assumed to be safe on all cases. Surely, for strictly-enlarging sets >> it would, but while min/max and bloom filters both fit the bill, it's >> not clear that one can assume this for all structures. > > > I don't understand what you mean. It's a fundamental property of minmax > indexes that you can always replace the "min" or "max" or "compressing set" > or "bounding box" or whatever with another datum that represents all the > keys that the old one did, plus some. Yes, and bloom filters happen to fall on that category too. Never mind what I said. I was thinking of other potential and imaginary implementation that supports removal or updates, that might need care with transaction lifetimes, but that's easily fixed by letting vacuum or some lazy process do the deleting just as it happens with other indexes anyway. So, I guess the interface must include also the invariant that compressed sets only grow, never shrink unless from a rebuild or a vacuum operation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas wrote: > Implementing something is a good way to demonstrate how it would look like. > But no, I don't insist on implementing every possible design whenever a new > feature is proposed. > > I liked Greg's sketch of what the opclass support functions would be. It > doesn't seem significantly more complicated than what's in the patch now. As a counter-point to my own point there will be nothing stopping us in the future from generalizing things. Dealing with catalogs is mostly book-keeping headaches and careful work. it's something that might be well-suited for a GSOC or first patch from someone looking to familiarize themselves with the system architecture. It's hard to invent a whole new underlying infrastructure at the same time as dealing with all that book-keeping and it's hard for someone familiarizing themselves with the system to also have a great new idea. Having tasks like this that are easy to explain and that mentor understands well can be easier to manage than tasks where the newcomer has some radical new idea. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Vik Fearing writes: > On 06/18/2014 12:46 PM, Andres Freund wrote: >> So to implement a feature one now has to implement the most generic >> variant as a prototype first? Really? > Well, there is the inventor's paradox to consider. I have not seen anyone demanding a different implementation in this thread. What *has* been asked for, and not supplied, is a concrete defense of the particular level of generality that's been selected in this implementation. It's not at all clear to the rest of us whether it was the right choice, and that is something that ought to be asked now not later. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 06/18/2014 06:09 PM, Claudio Freire wrote: On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark wrote: An aggregate to generate a min/max "bounding box" from several values A function which takes an "bounding box" and a new value and returns the new "bounding box" A function which tests if a value is in a "bounding box" A function which tests if a "bounding box" overlaps a "bounding box" Which I'd generalize a bit further by renaming "bounding box" with "compressed set", and allow it to be parameterized. What do you mean by parameterized? So, you have: An aggregate to generate a "compressed set" from several values A function which adds a new value to the "compressed set" and returns the new "compressed set" A function which tests if a value is in a "compressed set" A function which tests if a "compressed set" overlaps another "compressed set" of equal type Yeah, something like that. I'm not sure I like the "compressed set" term any more than bounding box, though. GiST seems to have avoided naming the thing, and just talks about "index entries". But if we can come up with a good name, that would be more clear. One problem with such a generalized implementation would be, that I'm not sure in-place modification of the "compressed set" on-disk can be assumed to be safe on all cases. Surely, for strictly-enlarging sets it would, but while min/max and bloom filters both fit the bill, it's not clear that one can assume this for all structures. I don't understand what you mean. It's a fundamental property of minmax indexes that you can always replace the "min" or "max" or "compressing set" or "bounding box" or whatever with another datum that represents all the keys that the old one did, plus some. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 06/18/2014 12:46 PM, Andres Freund wrote: >>> Isn't 'simpler implementation' a valid reason that's already been >>> > >discussed onlist? Obviously simpler implementation doesn't trump >>> > >everything, but it's one valid reason... >>> > >Note that I have zap to do with the design of this feature. I work for >>> > >the same company as Alvaro, but that's pretty much it... >> > >> > Without some analysis (e.g implementing it and comparing), I don't buy that >> > it makes the implementation simpler to restrict it in this way. Maybe it >> > does, but often it's actually simpler to solve the general case. > > So to implement a feature one now has to implement the most generic > variant as a prototype first? Really? Well, there is the inventor's paradox to consider. -- Vik -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Tue, Jun 17, 2014 at 2:16 PM, Andres Freund wrote: >> But I'm not trying to say that we absolutely have to support that kind >> of thing; what I am trying to say is that there should be a README or >> a mailing list post or some such that says: "We thought about how >> generic to make this. We considered A, B, and C. We rejected C as >> too narrow, and A because if we made it that general it would have >> greatly enlarged the disk footprint for the following reasons. >> Therefore we selected B." > > Isn't 'simpler implementation' a valid reason that's already been > discussed onlist? Obviously simpler implementation doesn't trump > everything, but it's one valid reason... > Note that I have zap to do with the design of this feature. I work for > the same company as Alvaro, but that's pretty much it... It really *hasn't* been discussed on-list. See these emails, discussing the same ideas, from 8 months ago: http://www.postgresql.org/message-id/5249b2d3.6030...@vmware.com http://www.postgresql.org/message-id/ca+tgmoyscbw-uc8lqv96szignxsuzayqbfdqmk-fgu22hdk...@mail.gmail.com Now, Alvaro did not respond to those emails, nor did anyone involved in the development of the feature. There may be an argument that implementing that would be too complicated, but Heikki said he didn't think it would be, and nobody's made a concrete argument as to why he's wrong (and Heikki knows a lot about indexing). >> Basically, I think Heikki asked a good >> question - which was "could we abstract this more?" - and I can't >> recall seeing a clear answer explaining why we could or couldn't and >> what the trade-offs would be. > > 'could we abstract more' imo is a pretty bad design guideline. It's 'is > there benefit in abstracting more'. Otherwise you end up with way to > complicated systems. On the flip side, if you don't abstract enough, you end up being able to cover only a small set of the relevant use cases, or else you end up with a bunch of poorly-coordinated tools to cover slightly different use cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 06/18/2014 01:46 PM, Andres Freund wrote: On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote: The main problem with using it for geometric types is that you can't easily CLUSTER the table to make the minmax index effective again. But there are ways around that. Which are? Sure you can try stuff like recreating the table, sorting rows with boundary boxes area above threshold first, and then go on to sort by the lop left corner of the bounding box. Right, something like that. Or cluster using some other column that correlates with the geometry, like a zip code. But that'll be neither builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER for types with a btree opclass which will yield the perfect order. Sure. BTW, CLUSTERing by a geometric type would be useful anyway, even without minmax indexes. Isn't 'simpler implementation' a valid reason that's already been discussed onlist? Obviously simpler implementation doesn't trump everything, but it's one valid reason... Note that I have zap to do with the design of this feature. I work for the same company as Alvaro, but that's pretty much it... Without some analysis (e.g implementing it and comparing), I don't buy that it makes the implementation simpler to restrict it in this way. Maybe it does, but often it's actually simpler to solve the general case. So to implement a feature one now has to implement the most generic variant as a prototype first? Really? Implementing something is a good way to demonstrate how it would look like. But no, I don't insist on implementing every possible design whenever a new feature is proposed. I liked Greg's sketch of what the opclass support functions would be. It doesn't seem significantly more complicated than what's in the patch now. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 2014-06-17 16:48:07 -0700, Greg Stark wrote: > On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund > wrote: > > Well, it might be doable to correlate them along one axis, but along > > both? That's more complicated... And even alongside one axis you > > already get into problems if your geometries are irregularly sized. > > Asingle large polygon will completely destroy indexability for anything > > stored physically close by because suddently the minmax range will be > > huge... So you'll need to cleverly sort for that as well. > > I think there's a misunderstanding here, possibly mine. My > understanding is that a min/max index will always be exactly the same > size for a given size table. It stores the minimum and maximum value > of the key for each page. Then you can do a bitmap scan by comparing > the search key with each page's minimum and maximum to see if that > page needs to be included in the scan. The failure mode is not that > the index is large but that a page that has an outlier will be > included in every scan as a false positive incurring an extra iop. I just rechecked, and no, it doesn't, by default, store a range for each page. It's MINMAX_DEFAULT_PAGES_PER_RANGE=128 pages by default... Haven't checked what's the lowest it can be se tto. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote: > On 06/17/2014 09:16 PM, Andres Freund wrote: > >Well, it might be doable to correlate them along one axis, but along > >both? That's more complicated... And even alongside one axis you > >already get into problems if your geometries are irregularly sized. > > Sure, there are cases where it would be useless. But it's easy to imagine > scenarios where it would work well, where points are loaded in clusters and > points that are close to each other also end up physically close to each > other. > >Asingle large polygon will completely destroy indexability for anything > >stored physically close by because suddently the minmax range will be > >huge... So you'll need to cleverly sort for that as well. > > That's an inherent risk with minmax indexes: insert a few rows to the > "wrong" locations in the heap, and the selectivity of the index degrades > rapidly. Sure. But it's fairly normal to have natural clusteredness in many columns (surrogate keys, dateseries type of data). Even if you insert geometric types in a geographic clusters you'll have worse results because some bounding boxes will be big and such. And: > The main problem with using it for geometric types is that you can't easily > CLUSTER the table to make the minmax index effective again. But there are > ways around that. Which are? Sure you can try stuff like recreating the table, sorting rows with boundary boxes area above threshold first, and then go on to sort by the lop left corner of the bounding box. But that'll be neither builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER for types with a btree opclass which will yield the perfect order. > >Isn't 'simpler implementation' a valid reason that's already been > >discussed onlist? Obviously simpler implementation doesn't trump > >everything, but it's one valid reason... > >Note that I have zap to do with the design of this feature. I work for > >the same company as Alvaro, but that's pretty much it... > > Without some analysis (e.g implementing it and comparing), I don't buy that > it makes the implementation simpler to restrict it in this way. Maybe it > does, but often it's actually simpler to solve the general case. So to implement a feature one now has to implement the most generic variant as a prototype first? Really? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 06/17/2014 09:16 PM, Andres Freund wrote: On 2014-06-17 12:14:00 -0400, Robert Haas wrote: On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund wrote: Well, I'm not the guy who does things with geometric data, but I don't want to ignore the significant percentage of our users who are. As you must surely know, the GIST implementations for geometric data types store bounding boxes on internal pages, and that seems to be useful to people. What is your reason for thinking that it would be any less useful in this context? For me minmax indexes are helpful because they allow to generate *small* 'coarse' indexes over large volumes of data. From my pov that's possible possible because they don't contain item pointers for every contained row. That'ill imo work well if there are consecutive rows in the table that can be summarized into one min/max range. That's quite likely to happen for common applications of number of scalar datatypes. But the likelihood of placing sufficiently many rows with very similar bounding boxes close together seems much less relevant in practice. And I think that's generally likely for operations which can't be well represented as btree opclasses - the substructure that implies inside a Datum will make correlation between consecutive rows less likely. Well, I don't know: suppose you're loading geospatial data showing the location of every building in some country. It might easily be the case that the data is or can be loaded in an order that provides pretty good spatial locality, leading to tight bounding boxes over physically consecutive data ranges. Well, it might be doable to correlate them along one axis, but along both? That's more complicated... And even alongside one axis you already get into problems if your geometries are irregularly sized. Sure, there are cases where it would be useless. But it's easy to imagine scenarios where it would work well, where points are loaded in clusters and points that are close to each other also end up physically close to each other. Asingle large polygon will completely destroy indexability for anything stored physically close by because suddently the minmax range will be huge... So you'll need to cleverly sort for that as well. That's an inherent risk with minmax indexes: insert a few rows to the "wrong" locations in the heap, and the selectivity of the index degrades rapidly. The main problem with using it for geometric types is that you can't easily CLUSTER the table to make the minmax index effective again. But there are ways around that. But I'm not trying to say that we absolutely have to support that kind of thing; what I am trying to say is that there should be a README or a mailing list post or some such that says: "We thought about how generic to make this. We considered A, B, and C. We rejected C as too narrow, and A because if we made it that general it would have greatly enlarged the disk footprint for the following reasons. Therefore we selected B." Isn't 'simpler implementation' a valid reason that's already been discussed onlist? Obviously simpler implementation doesn't trump everything, but it's one valid reason... Note that I have zap to do with the design of this feature. I work for the same company as Alvaro, but that's pretty much it... Without some analysis (e.g implementing it and comparing), I don't buy that it makes the implementation simpler to restrict it in this way. Maybe it does, but often it's actually simpler to solve the general case. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund wrote: > Well, it might be doable to correlate them along one axis, but along > both? That's more complicated... And even alongside one axis you > already get into problems if your geometries are irregularly sized. > Asingle large polygon will completely destroy indexability for anything > stored physically close by because suddently the minmax range will be > huge... So you'll need to cleverly sort for that as well. I think there's a misunderstanding here, possibly mine. My understanding is that a min/max index will always be exactly the same size for a given size table. It stores the minimum and maximum value of the key for each page. Then you can do a bitmap scan by comparing the search key with each page's minimum and maximum to see if that page needs to be included in the scan. The failure mode is not that the index is large but that a page that has an outlier will be included in every scan as a false positive incurring an extra iop. I don't think it's implausible at all that Geometric data would work well. If you load Geometric data it's very common to load data by geographic area so that all objects in San Francisco in one part of the data load, probably even by zip code or census block. What operations would an opclass for min/max need? I think it would be pretty similar to the operators that GiST needs (thankfully minus the baroque page split function): An aggregate to generate a min/max "bounding box" from several values A function which takes an "bounding box" and a new value and returns the new "bounding box" A function which tests if a value is in a "bounding box" A function which tests if a "bounding box" overlaps a "bounding box" The nice thing is this would let us add things like range @> (contains element) to the plain integer min/max case. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 06/17/2014 09:14 AM, Robert Haas wrote: > Well, I don't know: suppose you're loading geospatial data showing the > location of every building in some country. It might easily be the > case that the data is or can be loaded in an order that provides > pretty good spatial locality, leading to tight bounding boxes over > physically consecutive data ranges. I admin a production application which has exactly this. However, that application doesn't have big enough data to benefit from minmax indexes; it uses the basic spatial indexes. So, my $0.02: bounding box minmax falls under the heading of "would be nice to have, but not if it delays the feature". -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Tue, Jun 17, 2014 at 1:04 PM, Andres Freund wrote: > For me minmax indexes are helpful because they allow to generate *small* > 'coarse' indexes over large volumes of data. From my pov that's possible > possible because they don't contain item pointers for every contained > row. But minmax is just a specific form of bloom filter. This could certainly be generalized to a bloom filter index with some set of bloom&hashing operators (minmax being just one). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 2014-06-17 12:14:00 -0400, Robert Haas wrote: > On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund > wrote: > >> Well, I'm not the guy who does things with geometric data, but I don't > >> want to ignore the significant percentage of our users who are. As > >> you must surely know, the GIST implementations for geometric data > >> types store bounding boxes on internal pages, and that seems to be > >> useful to people. What is your reason for thinking that it would be > >> any less useful in this context? > > > > For me minmax indexes are helpful because they allow to generate *small* > > 'coarse' indexes over large volumes of data. From my pov that's possible > > possible because they don't contain item pointers for every contained > > row. > > That'ill imo work well if there are consecutive rows in the table that > > can be summarized into one min/max range. That's quite likely to happen > > for common applications of number of scalar datatypes. But the > > likelihood of placing sufficiently many rows with very similar bounding > > boxes close together seems much less relevant in practice. And I think > > that's generally likely for operations which can't be well represented > > as btree opclasses - the substructure that implies inside a Datum will > > make correlation between consecutive rows less likely. > > Well, I don't know: suppose you're loading geospatial data showing the > location of every building in some country. It might easily be the > case that the data is or can be loaded in an order that provides > pretty good spatial locality, leading to tight bounding boxes over > physically consecutive data ranges. Well, it might be doable to correlate them along one axis, but along both? That's more complicated... And even alongside one axis you already get into problems if your geometries are irregularly sized. Asingle large polygon will completely destroy indexability for anything stored physically close by because suddently the minmax range will be huge... So you'll need to cleverly sort for that as well. I think hierarchical datastructures are so much better suited for this, that there's little point trying to fit them into minmax. I can very well imagine that there's benefit in a gist support for only storing one pointer per block instead of one pointer per item or such. But it seems like separate feature. > But I'm not trying to say that we absolutely have to support that kind > of thing; what I am trying to say is that there should be a README or > a mailing list post or some such that says: "We thought about how > generic to make this. We considered A, B, and C. We rejected C as > too narrow, and A because if we made it that general it would have > greatly enlarged the disk footprint for the following reasons. > Therefore we selected B." Isn't 'simpler implementation' a valid reason that's already been discussed onlist? Obviously simpler implementation doesn't trump everything, but it's one valid reason... Note that I have zap to do with the design of this feature. I work for the same company as Alvaro, but that's pretty much it... > Basically, I think Heikki asked a good > question - which was "could we abstract this more?" - and I can't > recall seeing a clear answer explaining why we could or couldn't and > what the trade-offs would be. 'could we abstract more' imo is a pretty bad design guideline. It's 'is there benefit in abstracting more'. Otherwise you end up with way to complicated systems. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund wrote: >> Well, I'm not the guy who does things with geometric data, but I don't >> want to ignore the significant percentage of our users who are. As >> you must surely know, the GIST implementations for geometric data >> types store bounding boxes on internal pages, and that seems to be >> useful to people. What is your reason for thinking that it would be >> any less useful in this context? > > For me minmax indexes are helpful because they allow to generate *small* > 'coarse' indexes over large volumes of data. From my pov that's possible > possible because they don't contain item pointers for every contained > row. > That'ill imo work well if there are consecutive rows in the table that > can be summarized into one min/max range. That's quite likely to happen > for common applications of number of scalar datatypes. But the > likelihood of placing sufficiently many rows with very similar bounding > boxes close together seems much less relevant in practice. And I think > that's generally likely for operations which can't be well represented > as btree opclasses - the substructure that implies inside a Datum will > make correlation between consecutive rows less likely. Well, I don't know: suppose you're loading geospatial data showing the location of every building in some country. It might easily be the case that the data is or can be loaded in an order that provides pretty good spatial locality, leading to tight bounding boxes over physically consecutive data ranges. But I'm not trying to say that we absolutely have to support that kind of thing; what I am trying to say is that there should be a README or a mailing list post or some such that says: "We thought about how generic to make this. We considered A, B, and C. We rejected C as too narrow, and A because if we made it that general it would have greatly enlarged the disk footprint for the following reasons. Therefore we selected B." Basically, I think Heikki asked a good question - which was "could we abstract this more?" - and I can't recall seeing a clear answer explaining why we could or couldn't and what the trade-offs would be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 2014-06-17 11:48:10 -0400, Robert Haas wrote: > On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund > wrote: > > On 2014-06-17 10:26:11 -0400, Robert Haas wrote: > >> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera > >> wrote: > >> > Robert Haas wrote: > >> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera > >> >> wrote: > >> >> > Here's an updated version of this patch, with fixes to all the bugs > >> >> > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers > >> >> > and > >> >> > Amit Kapila for the reports. > >> >> > >> >> I'm not very happy with the use of a separate relation fork for > >> >> storing this data. > >> > > >> > Here's a new version of this patch. Now the revmap is not stored in a > >> > separate fork, but together with all the regular data, as explained > >> > elsewhere in the thread. > >> > >> Cool. > >> > >> Have you thought more about this comment from Heikki? > >> > >> http://www.postgresql.org/message-id/52495dd3.9010...@vmware.com > > > > Is there actually a significant usecase behind that wish or just a > > general demand for being generic? To me it seems fairly unlikely you'd > > end up with something useful by doing a minmax index over bounding > > boxes. > > Well, I'm not the guy who does things with geometric data, but I don't > want to ignore the significant percentage of our users who are. As > you must surely know, the GIST implementations for geometric data > types store bounding boxes on internal pages, and that seems to be > useful to people. What is your reason for thinking that it would be > any less useful in this context? For me minmax indexes are helpful because they allow to generate *small* 'coarse' indexes over large volumes of data. From my pov that's possible possible because they don't contain item pointers for every contained row. That'ill imo work well if there are consecutive rows in the table that can be summarized into one min/max range. That's quite likely to happen for common applications of number of scalar datatypes. But the likelihood of placing sufficiently many rows with very similar bounding boxes close together seems much less relevant in practice. And I think that's generally likely for operations which can't be well represented as btree opclasses - the substructure that implies inside a Datum will make correlation between consecutive rows less likely. Maybe I've a major intuition failure here though... > I do also think that a general demand for being generic ought to carry > some weight. Agreed. It's always a balance act. But it's not like this doesn't use a datatype abstraction concept... > We have gone to great lengths to make sure that our > indexing can handle more than just < and >, where a lot of other > products have not bothered. I think we have gotten a lot of mileage > out of that decision and feel that we shouldn't casually back away > from it. I don't see this as a case of backing away from that though? > we shouldn't accept a less-generic > approach blindly, without questioning whether it's possible to do > better. But the aim shouldn't be to add genericity that's not going to be used, but to add it where it's somewhat likely to help... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund wrote: > On 2014-06-17 10:26:11 -0400, Robert Haas wrote: >> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera >> wrote: >> > Robert Haas wrote: >> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera >> >> wrote: >> >> > Here's an updated version of this patch, with fixes to all the bugs >> >> > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and >> >> > Amit Kapila for the reports. >> >> >> >> I'm not very happy with the use of a separate relation fork for >> >> storing this data. >> > >> > Here's a new version of this patch. Now the revmap is not stored in a >> > separate fork, but together with all the regular data, as explained >> > elsewhere in the thread. >> >> Cool. >> >> Have you thought more about this comment from Heikki? >> >> http://www.postgresql.org/message-id/52495dd3.9010...@vmware.com > > Is there actually a significant usecase behind that wish or just a > general demand for being generic? To me it seems fairly unlikely you'd > end up with something useful by doing a minmax index over bounding > boxes. Well, I'm not the guy who does things with geometric data, but I don't want to ignore the significant percentage of our users who are. As you must surely know, the GIST implementations for geometric data types store bounding boxes on internal pages, and that seems to be useful to people. What is your reason for thinking that it would be any less useful in this context? I do also think that a general demand for being generic ought to carry some weight. We have gone to great lengths to make sure that our indexing can handle more than just < and >, where a lot of other products have not bothered. I think we have gotten a lot of mileage out of that decision and feel that we shouldn't casually back away from it. Obviously, we do already have some special-case optimizations and will likely have more in the future, and there are can certainly be valid reasons for taking that approach. But it needs to be justified in some way; we shouldn't accept a less-generic approach blindly, without questioning whether it's possible to do better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Tue, Jun 17, 2014 at 3:31 PM, Andres Freund wrote: > Is there actually a significant usecase behind that wish or just a > general demand for being generic? To me it seems fairly unlikely you'd > end up with something useful by doing a minmax index over bounding > boxes. Isn't min/max just a 2d bounding box? If you do a bulk data load of something like the census data then sure, every page will have data points for some geometrically clustered set of data. I had in mind to do a small bloom filter per block. In general any kind of predicate like bounding box should work. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 2014-06-17 10:26:11 -0400, Robert Haas wrote: > On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera > wrote: > > Robert Haas wrote: > >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera > >> wrote: > >> > Here's an updated version of this patch, with fixes to all the bugs > >> > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and > >> > Amit Kapila for the reports. > >> > >> I'm not very happy with the use of a separate relation fork for > >> storing this data. > > > > Here's a new version of this patch. Now the revmap is not stored in a > > separate fork, but together with all the regular data, as explained > > elsewhere in the thread. > > Cool. > > Have you thought more about this comment from Heikki? > > http://www.postgresql.org/message-id/52495dd3.9010...@vmware.com Is there actually a significant usecase behind that wish or just a general demand for being generic? To me it seems fairly unlikely you'd end up with something useful by doing a minmax index over bounding boxes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera wrote: > Robert Haas wrote: >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera >> wrote: >> > Here's an updated version of this patch, with fixes to all the bugs >> > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and >> > Amit Kapila for the reports. >> >> I'm not very happy with the use of a separate relation fork for >> storing this data. > > Here's a new version of this patch. Now the revmap is not stored in a > separate fork, but together with all the regular data, as explained > elsewhere in the thread. Cool. Have you thought more about this comment from Heikki? http://www.postgresql.org/message-id/52495dd3.9010...@vmware.com I'm concerned that we could end up with one index type of this general nature for min/max type operations, and then another very similar index type for geometric operators or text-search operators or what have you. Considering the overhead in adding and maintaining an index AM, I think we should try to be sure that we've done a reasonably solid job making each one as general as we reasonably can. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Jan 24, 2014 at 12:58 PM, Claudio Freire wrote: > > What's the status? > > I believe I have more than a use for minmax indexes, and wouldn't mind > lending a hand if it's within my grasp. I'm also interested in looking at this. Mostly because I have ideas for other "summary" functions that would be interesting and could use the same infrastructure otherwise. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Jan 24, 2014 at 2:54 PM, Thom Brown wrote: > On 24 January 2014 17:53, Alvaro Herrera wrote: >> Thom Brown wrote: >>> On 8 November 2013 20:11, Alvaro Herrera wrote: >>> > Erik Rijkers wrote: >>> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote: >>> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: >>> >> > >>> >> >> [minmax-5.patch] >>> >> > >>> >> > I have the impression it's not quite working correctly. >>> > >>> > Here's a version 7 of the patch, which fixes these bugs and adds >>> > opclasses for a bunch more types (timestamp, timestamptz, date, time, >>> > timetz), courtesy of Martín Marqués. It's also been rebased to apply >>> > cleanly on top of today's master branch. >>> > >>> > I have also added a selectivity function, but I'm not positive that it's >>> > very useful yet. >>> >>> This patch doesn't appear to have been submitted to any Commitfest. >>> Is this still a feature undergoing research then? >> >> It's still a planned feature, but I didn't have time to continue work >> for 2014-01. What's the status? I believe I have more than a use for minmax indexes, and wouldn't mind lending a hand if it's within my grasp. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 24 January 2014 17:53, Alvaro Herrera wrote: > Thom Brown wrote: >> On 8 November 2013 20:11, Alvaro Herrera wrote: >> > Erik Rijkers wrote: >> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote: >> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: >> >> > >> >> >> [minmax-5.patch] >> >> > >> >> > I have the impression it's not quite working correctly. >> > >> > Here's a version 7 of the patch, which fixes these bugs and adds >> > opclasses for a bunch more types (timestamp, timestamptz, date, time, >> > timetz), courtesy of Martín Marqués. It's also been rebased to apply >> > cleanly on top of today's master branch. >> > >> > I have also added a selectivity function, but I'm not positive that it's >> > very useful yet. >> >> This patch doesn't appear to have been submitted to any Commitfest. >> Is this still a feature undergoing research then? > > It's still a planned feature, but I didn't have time to continue work > for 2014-01. Alles klar. Thanks -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Thom Brown wrote: > On 8 November 2013 20:11, Alvaro Herrera wrote: > > Erik Rijkers wrote: > >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote: > >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: > >> > > >> >> [minmax-5.patch] > >> > > >> > I have the impression it's not quite working correctly. > > > > Here's a version 7 of the patch, which fixes these bugs and adds > > opclasses for a bunch more types (timestamp, timestamptz, date, time, > > timetz), courtesy of Martín Marqués. It's also been rebased to apply > > cleanly on top of today's master branch. > > > > I have also added a selectivity function, but I'm not positive that it's > > very useful yet. > > This patch doesn't appear to have been submitted to any Commitfest. > Is this still a feature undergoing research then? It's still a planned feature, but I didn't have time to continue work for 2014-01. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On 8 November 2013 20:11, Alvaro Herrera wrote: > Erik Rijkers wrote: >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote: >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: >> > >> >> [minmax-5.patch] >> > >> > I have the impression it's not quite working correctly. > > Here's a version 7 of the patch, which fixes these bugs and adds > opclasses for a bunch more types (timestamp, timestamptz, date, time, > timetz), courtesy of Martín Marqués. It's also been rebased to apply > cleanly on top of today's master branch. > > I have also added a selectivity function, but I'm not positive that it's > very useful yet. This patch doesn't appear to have been submitted to any Commitfest. Is this still a feature undergoing research then? -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, Nov 8, 2013 at 12:11 PM, Alvaro Herrera wrote: > Erik Rijkers wrote: > > On Thu, September 26, 2013 00:34, Erik Rijkers wrote: > > > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote: > > > > > >> [minmax-5.patch] > > > > > > I have the impression it's not quite working correctly. > > Here's a version 7 of the patch, which fixes these bugs and adds > opclasses for a bunch more types (timestamp, timestamptz, date, time, > timetz), courtesy of Martín Marqués. It's also been rebased to apply > cleanly on top of today's master branch. > > I have also added a selectivity function, but I'm not positive that it's > very useful yet. > I tested it with attached script, but broke out of the "for" loop after 5 iterations (when it had 300,000,005 rows inserted) Then I did an analyze, and got an error message below: jjanes=# analyze; ERROR: could not truncate file "base/16384/16388_vm" to 488 blocks: it's only 82 blocks now 16388 is the index's relfilenode. Here is the backtrace upon entry to the truncate that is going to fail: #0 mdtruncate (reln=0x23c91b0, forknum=VISIBILITYMAP_FORKNUM, nblocks=488) at md.c:858 #1 0x0048eb4a in mmRevmapTruncate (rmAccess=0x26ad878, heapNumBlocks=1327434) at mmrevmap.c:360 #2 0x0048d37a in mmvacuumcleanup (fcinfo=) at minmax.c:1264 #3 0x0072dcef in FunctionCall2Coll (flinfo=, collation=, arg1=, arg2=) at fmgr.c:1323 #4 0x0048c1e5 in index_vacuum_cleanup (info=, stats=0x0) at indexam.c:715 #5 0x0052a7ce in do_analyze_rel (onerel=0x7f59798589e8, vacstmt=0x23b0bd8, acquirefunc=0x5298d0 , relpages=1327434, inh=0 '\000', elevel=13) at analyze.c:634 #6 0x0052b320 in analyze_rel (relid=, vacstmt=0x23b0bd8, bstrategy=) at analyze.c:267 #7 0x0057cba7 in vacuum (vacstmt=0x23b0bd8, relid=, do_toast=1 '\001', bstrategy=, for_wraparound=0 '\000', isTopLevel=) at vacuum.c:249 #8 0x00663177 in standard_ProcessUtility (parsetree=0x23b0bd8, queryString=, context=, params=0x0, dest=, completionTag=) at utility.c:682 #9 0x7f598290b791 in pgss_ProcessUtility (parsetree=0x23b0bd8, queryString=0x23b0220 "analyze \n;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x23b0f18, completionTag=0x7fffd3442f30 "") at pg_stat_statements.c:825 #10 0x0065fcf7 in PortalRunUtility (portal=0x24195e0, utilityStmt=0x23b0bd8, isTopLevel=1 '\001', dest=0x23b0f18, completionTag=0x7fffd3442f30 "") at pquery.c:1187 #11 0x00660c6d in PortalRunMulti (portal=0x24195e0, isTopLevel=1 '\001', dest=0x23b0f18, altdest=0x23b0f18, completionTag=0x7fffd3442f30 "") at pquery.c:1318 #12 0x00661323 in PortalRun (portal=0x24195e0, count=9223372036854775807, isTopLevel=1 '\001', dest=0x23b0f18, altdest=0x23b0f18, completionTag=0x7fffd3442f30 "") at pquery.c:816 #13 0x0065dbb4 in exec_simple_query (query_string=0x23b0220 "analyze \n;") at postgres.c:1048 #14 0x0065f259 in PostgresMain (argc=, argv=, dbname=0x2347be8 "jjanes", username=) at postgres.c:3992 #15 0x0061b7d0 in BackendRun (argc=, argv=) at postmaster.c:4085 #16 BackendStartup (argc=, argv=) at postmaster.c:3774 #17 ServerLoop (argc=, argv=) at postmaster.c:1585 #18 PostmasterMain (argc=, argv=) at postmaster.c:1240 #19 0x005b5e90 in main (argc=3, argv=0x2346cd0) at main.c:196 Cheers, Jeff minmax_test3.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes (timings)
Erik Rijkers wrote: > Perhaps someone finds these timings useful. > '--enable-cassert' Assertions can really distort the timings, and not always equally for all code paths. Any chance of re-running those tests without that? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes (timings)
On 2013-11-15 17:11:46 +0100, Erik Rijkers wrote: > I've been messing with minmax indexes some more so here are some results of > that. > > Perhaps someone finds these timings useful. > > > Centos 5.7, 32 GB memory, 2 quadcores. > > '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' > '--with-pgport=6444' '--enable-depend' '--enable-cassert' > '--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' > '--enable-dtrace' Just some general advice: doing timings with --enale-cassert isn't that meaningful - it often can distort results significantly. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Mon, Nov 11, 2013 at 12:53 AM, Erik Rijkers wrote: > On Fri, November 8, 2013 21:11, Alvaro Herrera wrote: > > > > Here's a version 7 of the patch, which fixes these bugs and adds > > opclasses for a bunch more types (timestamp, timestamptz, date, time, > > timetz), courtesy of Martín Marqués. It's also been rebased to apply > > cleanly on top of today's master branch. > > > > I have also added a selectivity function, but I'm not positive that it's > > very useful yet. > > > > [minmax-7.patch] > > The earlier errors are indeed fixed; now, I've been trying with the > attached test case but I'm unable to find a query that > improves with minmax index use. (it gets used sometimes but speedup is > negligable). > Your data set seems to be completely random. I believe that minmax indices would only be expected to be useful when the data is clustered. Perhaps you could try it on a table where it is populated something like i+random()/10*max_i. Cheers, Jeff
Re: [HACKERS] Minmax indexes
On Mon, November 11, 2013 09:53, Erik Rijkers wrote: > On Fri, November 8, 2013 21:11, Alvaro Herrera wrote: >> >> Here's a version 7 of the patch, which fixes these bugs and adds >> opclasses for a bunch more types (timestamp, timestamptz, date, time, >> timetz), courtesy of Martín Marqués. It's also been rebased to apply >> cleanly on top of today's master branch. >> >> I have also added a selectivity function, but I'm not positive that it's >> very useful yet. >> >> [minmax-7.patch] > > The earlier errors are indeed fixed; now, I've been trying with the attached > test case but I'm unable to find a query that > improves with minmax index use. (it gets used sometimes but speedup is > negligable). > Another issue (I think): Attached is a program (and output as a .txt file) that gives the following (repeatable) error: $ ./casanova_test.sh \timing on drop table if exists t1; Time: 333.159 ms create table t1 (i int); Time: 155.827 ms create index t1_i_idx on t1 using minmax(i); Time: 204.031 ms insert into t1 select generate_series(1, 2500); Time: 126312.302 ms analyze t1; ERROR: could not truncate file base/21324/26339_vm to 41 blocks: it's only 1 blocks now Time: 472.504 ms [...] Thanks, Erik Rijkers casanova_test.sh Description: application/shellscript $ ./casanova_test.sh \timing on drop table if exists t1; Time: 333.159 ms create table t1 (i int); Time: 155.827 ms create index t1_i_idx on t1 using minmax(i); Time: 204.031 ms insert into t1 select generate_series(1, 2500); Time: 126312.302 ms analyze t1; ERROR: could not truncate file base/21324/26339_vm to 41 blocks: it's only 1 blocks now Time: 472.504 ms \timing on set enable_bitmapscan=1; explain analyze select i from t1 where i between 1000 and 10001000; Time: 0.508 ms QUERY PLAN -- Bitmap Heap Scan on t1 (cost=32.25..117786.77 rows=125001 width=4) (actual time=1640.520..4465.768 rows=1001 loops=1) Recheck Cond: ((i >= 1000) AND (i <= 10001000)) Rows Removed by Index Recheck: 24998999 -> Bitmap Index Scan on t1_i_idx (cost=0.00..1.00 rows=125001 width=0) (actual time=65.322..65.322 rows=291 loops=1) Index Cond: ((i >= 1000) AND (i <= 10001000)) Total runtime: 4466.003 ms (6 rows) Time: 4468.943 ms set enable_bitmapscan=0; explain analyze select i from t1 where i between 1000 and 10001000; Time: 0.146 ms QUERY PLAN --- Seq Scan on t1 (cost=0.00..485621.80 rows=125001 width=4) (actual time=1158.065..3332.925 rows=1001 loops=1) Filter: ((i >= 1000) AND (i <= 10001000)) Rows Removed by Filter: 24998999 Total runtime: .038 ms (4 rows) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Fri, November 8, 2013 21:11, Alvaro Herrera wrote: > > Here's a version 7 of the patch, which fixes these bugs and adds > opclasses for a bunch more types (timestamp, timestamptz, date, time, > timetz), courtesy of Martín Marqués. It's also been rebased to apply > cleanly on top of today's master branch. > > I have also added a selectivity function, but I'm not positive that it's > very useful yet. > > [minmax-7.patch] The earlier errors are indeed fixed; now, I've been trying with the attached test case but I'm unable to find a query that improves with minmax index use. (it gets used sometimes but speedup is negligable). That probably means I'm doing something wrong; could you (or anyone) give some hints about use-case would be expected? (Or is it just the unfinished selectivity function?) Thanks, Erikjan Rijkers test.sh Description: application/shellscript -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Alvaro Herrera escribió: > I have been playing with having the revmap in the main fork of the index > rather than a separate one. ... > This is not complete yet; although I have a proof-of-concept working, I > still need to write XLog support code and update the pageinspect code to > match. Just to be clear: the v7 published elsewhere in this thread does not contain this revmap-in-main-fork code. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
Robert Haas escribió: > On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera > wrote: > > Here's an updated version of this patch, with fixes to all the bugs > > reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and > > Amit Kapila for the reports. > > I'm not very happy with the use of a separate relation fork for > storing this data. I have been playing with having the revmap in the main fork of the index rather than a separate one. On the surface many things stay just what they are; I only had to add a layer beneath the revmap that maps its logical block numbers to physical block numbers. The problem with this is that it needs more disk access, because revmap block numbers cannot be hardcoded. After doing some quick math, what I ended up with was to keep an array of BlockNumbers in the metapage. Each element in this array points to array pages; each array page is, in turn, filled with more BlockNumbers, which this time correspond to the logical revmap pages we used to have in the revmap fork. (I initially feared that this design would not allow me to address enough revmap pages for the largest of tables; but fortunately this is sufficient unless you configure very small pages, say BLCKSZ 2kB, use small page ranges, and use small datatypes, say "char". I have no problem with saying that that scenario is not supported if you want to have minmax indexes on 32 TB tables. I mean, who uses BLCKSZ smaller than 8kB anyway?). The advantage of this design is that in order to find any particular logical revmap page, you always have to do a constant number of page accesses. You read the metapage, then read the array page, then read the revmap page; done. Another idea I considered was chaining revmap pages (so each would have a pointer-to-next), or chaining array pages; but this would have meant that to locate an individual page to the end of the revmap, you might need to do many accesses. Not good. As an optimization for relatively small indexes, we hardcode the page number for the first revmap page: it's always the page right after the metapage (so BlockNumber 1). A revmap page can store, with the default page size, about 1350 item pointers; so with an index built for page ranges of 1000 pages per range, you can point to enough index entries for a ~10 GB table without having the need to examine the first array page. This seems pretty acceptable; people with larger tables can likely spare one extra page accessed every now and then. (For comparison, each regular minmax page can store about 500 index tuples, if it's built for a single 4-byte column; this means that the 10 GB table requires a 5-page index.) This is not complete yet; although I have a proof-of-concept working, I still need to write XLog support code and update the pageinspect code to match. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Minmax indexes
On Mon, Sep 30, 2013 at 1:20 PM, Heikki Linnakangas wrote: > You can almost create a bounding box opclass in the current implementation, > by mapping < operator to "contains" and > to "not contains". But there's no > support for creating a new, larger, bounding box on insert. It will just > replace the max with the new value if it's "greater than", when it should > create a whole new value to store in the index that covers both the old and > the new values. (or less than? I'm not sure which way those operators would > work..) This sounds an awful lot like GiST's "union" operation. Actually, following the GiST model of having "union" and "consistent" operations might be a smart way to go. Then the exact index semantics could be decided by the opclass. This might not even be that much extra code; the existing consistent and union functions for GiST are pretty short. That way, it'd be easy to add new opclasses with somewhat different behavior; the common thread would be that every opclass of this new AM works by summarizing a physical page range into a single indexed value. You might call the AM something like "summary" or "sparse" and then have "minmax_ops" for your first opclass. > In fact, even with regular b-tree operators, over integers for example, you > don't necessarily want to store both min and max. If you only ever perform > queries like "WHERE col > ?", there's no need to track the min value. So to > make this really general, you should be able to create an index on only the > minimum or maximum. Or if you want both, you can store them as separate > index columns. Something like: > > CREATE INDEX minindex ON table (col ASC); -- For min > CREATE INDEX minindex ON table (col DESC); -- For max > CREATE INDEX minindex ON table (col ASC, col DESC); -- For both This doesn't seem very general, since you're relying on the fact that ASC and DESC map to < and >. It's not clear what you'd write here if you wanted to optimize #$ and @!. But something based on opclasses will work, since each opclass can support an arbitrary set of operators. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers