subject:"Re\: \[HACKERS\] Minmax indexes"

Re: BRIN indexes (was Re: [HACKERS] Minmax indexes)

2014-09-14 Thread Emanuel Calvo


El 08/09/14 13:02, Alvaro Herrera escribió:
> Here's version 18.  I have renamed it: These are now BRIN indexes.
>
> I have fixed numerous race conditions and deadlocks.  In particular I
> fixed this problem you noted:
>
> Heikki Linnakangas wrote:
>> Another race condition:
>>
>> If a new tuple is inserted to the range while summarization runs,
>> it's possible that the new tuple isn't included in the tuple that
>> the summarization calculated, nor does the insertion itself udpate
>> it.
> I did it mostly in the way you outlined, i.e. by way of a placeholder
> tuple that gets updated by concurrent inserters and then the tuple
> resulting from the scan is unioned with the values in the updated
> placeholder tuple.  This required the introduction of one extra support
> proc for opclasses (pretty simple stuff anyhow).
>
> There should be only minor items left now, such as silencing the 
>
> WARNING:  concurrent insert in progress within table "sales"
>
> which is emitted by IndexBuildHeapScan (possibly thousands of times)
> when doing a summarization of a range being inserted into or otherwise
> modified.  Basically the issue here is that IBHS assumes it's being run
> with ShareLock in the heap (which blocks inserts), but here we're using
> it with ShareUpdateExclusive only, which lets inserts in.  There is no
> harm AFAICS because of the placeholder tuple stuff I describe above.

Debuging VACUUM VERBOSE ANALYZE over a concurrent table being
updated/insert.

(gbd)
Breakpoint 1, errfinish (dummy=0) at elog.c:411
411ErrorData  *edata = &errordata[errordata_stack_depth];

The complete backtrace is at http://pastebin.com/gkigSNm7


Also, I found pages with an unkown type (using deafult parameters for
the index
creation):

 brin_page_type | array_agg
+---
 unknown (00)   | {3,4}
 revmap | {1}
 regular| {2}
 meta   | {0}
(4 rows)




>
>

-- 
--
Emanuel Calvo
@3manuek

Re: BRIN indexes (was Re: [HACKERS] Minmax indexes)

2014-09-09 Thread Heikki Linnakangas


On 09/08/2014 07:02 PM, Alvaro Herrera wrote:

Here's version 18.  I have renamed it: These are now BRIN indexes.

I have fixed numerous race conditions and deadlocks.  In particular I
fixed this problem you noted:

Heikki Linnakangas wrote:

Another race condition:

If a new tuple is inserted to the range while summarization runs,
it's possible that the new tuple isn't included in the tuple that
the summarization calculated, nor does the insertion itself udpate
it.


I did it mostly in the way you outlined, i.e. by way of a placeholder
tuple that gets updated by concurrent inserters and then the tuple
resulting from the scan is unioned with the values in the updated
placeholder tuple.  This required the introduction of one extra support
proc for opclasses (pretty simple stuff anyhow).


Hmm. So the union support proc is only called if there is a race 
condition? That makes it very difficult to test, I'm afraid.


It would make more sense to pass BrinValues to the support functions, 
rather than DeformedBrTuple. An opclass'es support function should never 
need to access the values for other columns.


Does minmaxUnion handle NULLs correctly?

minmaxUnion pfrees the old values. Is that necessary? What memory 
context does the function run in? If the code runs in a short-lived 
memory context, you might as well just let them leak. If it runs in a 
long-lived context, well, perhaps it shouldn't. It's nicer to write 
functions that can leak freely. IIRC, GiST and GIN runs the support 
functions in a temporary context. In any case, it might be worth noting 
explicitly in the docs which functions may leak and which may not.


If you add a new datatype, and define b-tree operators for it, what is 
required to create a minmax opclass for it? Would it be possible to 
generalize the functions in brin_minmax.c so that they can be reused for 
any datatype (with b-tree operators) without writing any new C code? I 
think we're almost there; the only thing that differs between each data 
type is the opcinfo function. Let's pass the type OID as argument to the 
opcinfo function. You could then have just a single minmax_opcinfo 
function, instead of the macro to generate a separate function for each 
built-in datatype.


In general, this patch is in pretty good shape now, thanks!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-20 Thread Alvaro Herrera

Alvaro Herrera wrote:

> So here's v16, rebased on top of 9bac66020.  As far as I am concerned,
> this is the last version before I start renaming everything to BRIN and
> then commit.

FWIW in case you or others have interest, here's the diff between your
patch and v16.  Also, for illustrative purposes, the diff between
versions yours and mine of the code that got moved to mmpageops.c
because it's difficult to see it from the partial patch.  (There's
nothing to do with that partial diff other than read it directly.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
*** a/contrib/pageinspect/mmfuncs.c
--- b/contrib/pageinspect/mmfuncs.c
***
*** 29,35 
  PG_FUNCTION_INFO_V1(minmax_page_type);
  PG_FUNCTION_INFO_V1(minmax_page_items);
  PG_FUNCTION_INFO_V1(minmax_metapage_info);
- PG_FUNCTION_INFO_V1(minmax_revmap_array_data);
  PG_FUNCTION_INFO_V1(minmax_revmap_data);
  
  typedef struct mm_column_state
--- 29,34 
***
*** 388,394  minmax_revmap_data(PG_FUNCTION_ARGS)
  	values[0] = Int64GetDatum((uint64) 0);
  
  	/* Extract (possibly empty) list of TIDs in this page. */
! 	for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
  	{
  		ItemPointer	tid;
  
--- 387,393 
  	values[0] = Int64GetDatum((uint64) 0);
  
  	/* Extract (possibly empty) list of TIDs in this page. */
! 	for (i = 0; i < REVMAP_PAGE_MAXITEMS; i++)
  	{
  		ItemPointer	tid;
  
*** a/src/backend/access/minmax/Makefile
--- b/src/backend/access/minmax/Makefile
***
*** 12,17  subdir = src/backend/access/minmax
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
  
  include $(top_srcdir)/src/backend/common.mk
--- 12,17 
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = minmax.o mmpageops.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
  
  include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/minmax/minmax.c
--- b/src/backend/access/minmax/minmax.c
***
*** 15,45 
   */
  #include "postgres.h"
  
- #include "access/htup_details.h"
  #include "access/minmax.h"
  #include "access/minmax_internal.h"
  #include "access/minmax_page.h"
! #include "access/minmax_revmap.h"
! #include "access/minmax_tuple.h"
  #include "access/minmax_xlog.h"
  #include "access/reloptions.h"
  #include "access/relscan.h"
- #include "access/xlogutils.h"
  #include "catalog/index.h"
- #include "catalog/pg_operator.h"
- #include "commands/vacuum.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "storage/bufmgr.h"
  #include "storage/freespace.h"
- #include "storage/indexfsm.h"
- #include "storage/lmgr.h"
- #include "storage/smgr.h"
- #include "utils/datum.h"
- #include "utils/lsyscache.h"
- #include "utils/memutils.h"
  #include "utils/rel.h"
- #include "utils/syscache.h"
  
  
  /*
--- 15,33 
   */
  #include "postgres.h"
  
  #include "access/minmax.h"
  #include "access/minmax_internal.h"
  #include "access/minmax_page.h"
! #include "access/minmax_pageops.h"
  #include "access/minmax_xlog.h"
  #include "access/reloptions.h"
  #include "access/relscan.h"
  #include "catalog/index.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "storage/bufmgr.h"
  #include "storage/freespace.h"
  #include "utils/rel.h"
  
  
  /*
***
*** 75,93  static MMBuildState *initialize_mm_buildstate(Relation idxRel,
  static bool terminate_mm_buildstate(MMBuildState *state);
  static void summarize_range(MMBuildState *mmstate, Relation heapRel,
  BlockNumber heapBlk);
- static bool mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
- 			mmRevmapAccess *rmAccess, BlockNumber heapBlk,
- 			Buffer oldbuf, OffsetNumber oldoff,
- 			const MMTuple *origtup, Size origsz,
- 			const MMTuple *newtup, Size newsz,
- 			bool samepage, bool *extended);
- static void mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
- 			mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapblkno,
- 			MMTuple *tup, Size itemsz, bool *extended);
- static Buffer mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
-    bool *extended);
  static void form_and_insert_tuple(MMBuildState *mmstate);
- static Size mm_page_get_freespace(Page page);
  
  
  /*
--- 63,69 
***
*** 123,128  mminsert(PG_FUNCTION_ARGS)
--- 99,105 
  	rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
  
  restart:
+ 	CHECK_FOR_INTERRUPTS();
  	heapBlk = ItemPointerGetBlockNumber(heaptid);
  	/* normalize the block number to be the first block in the range */
  	heapBlk = (heapBlk / pagesPerRange) * pagesPerRange;
***
*** 155,161  restart:
  
  		addValue = index_getprocinfo(idxRel, keyno + 1,
  	 MINMAX_PROCNUM_ADDVALUE);
- 
  		result = FunctionCall5Coll(addValue,
     idxRel->rd_indcollation[keyno],
     PointerGetDatum(mmdesc),
--- 132,137

Re: [HACKERS] Minmax indexes

2014-08-15 Thread Alvaro Herrera

Fujii Masao wrote:

> I've not read the patch yet. But while testing the feature, I found that
> 
> * Brin index cannot be created on CHAR(n) column.
>Maybe other data types have the same problem.

Yeah, it's just a matter of adding an opclass for it -- pretty simple
stuff really, because you don't need to write any code, just add a bunch
of catalog entries and an OPCINFO line in mmsortable.c.

Right now there are opclasses for the following types:

int4
numeric
text
date
timestamp with time zone
timestamp
time with time zone
time
"char"

We can eventually extend to cover all types that have btree opclasses,
but we can do that in a separate commit.  I'm also considering removing
the opclass for time with time zone, as it's a pretty useless type.  I
mostly added the ones that are there as a way to test that it behaved
reasonably in the various cases (pass by val vs. not, variable width vs.
fixed, different alignment requirements)

Of course, the real interesting part is adding a completely different
opclass, such as one that stores bounding boxes.

> * FILLFACTOR cannot be set in brin index.

I hadn't added this one because I didn't think there was much point
previously, but I think it might now be useful to allow same-page
updates.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-15 Thread Heikki Linnakangas


On 08/15/2014 02:02 AM, Alvaro Herrera wrote:

Alvaro Herrera wrote:

Heikki Linnakangas wrote:



I'm sure this still needs some cleanup, but here's the patch, based
on your v14. Now that I know what this approach looks like, I still
like it much better. The insert and update code is somewhat more
complicated, because you have to be careful to lock the old page,
new page, and revmap page in the right order. But it's not too bad,
and it gets rid of all the complexity in vacuum.


It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason.  minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:


I fixed this issue, and did a lot more rework and bugfixing.  Here's
v15, based on v14-heikki2.


So, the other design change I've been advocating is to store the revmap 
in the first N blocks, instead of having the two-level structure with 
array pages and revmap pages.


Attached is a patch for that, to be applied after v15. When the revmap 
needs to be expanded, all the tuples on it are moved elsewhere 
one-by-one. That adds some latency to the unfortunate guy who needs to 
do that, but as the patch stands, the revmap is only ever extended by 
VACUUM or CREATE INDEX, so I think that's fine. Like with my previous 
patch, the point is to demonstrate how much simpler the code becomes 
this way; I'm sure there are bugs and cleanup still necessary.


PS. Spotted one oversight in patch v15: callers of mm_doupdate must 
check the return value, and retry the operation if it returns false.


- Heikki

commit ce4df0e9dbd43f7e3d4fdf3f7920301f81f17d63
Author: Heikki Linnakangas 
Date:   Fri Aug 15 18:32:19 2014 +0300

Get rid of array pages. Instead, move all tuples out of the way

diff --git a/contrib/pageinspect/mmfuncs.c b/contrib/pageinspect/mmfuncs.c
index 6cd559a..51cc9e2 100644
--- a/contrib/pageinspect/mmfuncs.c
+++ b/contrib/pageinspect/mmfuncs.c
@@ -74,9 +74,6 @@ minmax_page_type(PG_FUNCTION_ARGS)
 		case MINMAX_PAGETYPE_META:
 			type = "meta";
 			break;
-		case MINMAX_PAGETYPE_REVMAP_ARRAY:
-			type = "revmap array";
-			break;
 		case MINMAX_PAGETYPE_REVMAP:
 			type = "revmap";
 			break;
@@ -343,11 +340,9 @@ minmax_metapage_info(PG_FUNCTION_ARGS)
 	Page		page;
 	MinmaxMetaPageData *meta;
 	TupleDesc	tupdesc;
-	Datum		values[3];
-	bool		nulls[3];
-	ArrayBuildState *astate = NULL;
+	Datum		values[4];
+	bool		nulls[4];
 	HeapTuple	htup;
-	int			i;
 
 	page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage");
 
@@ -361,22 +356,8 @@ minmax_metapage_info(PG_FUNCTION_ARGS)
 	MemSet(nulls, 0, sizeof(nulls));
 	values[0] = CStringGetTextDatum(psprintf("0x%08X", meta->minmaxMagic));
 	values[1] = Int32GetDatum(meta->minmaxVersion);
-
-	/* Extract (possibly empty) list of revmap array page numbers. */
-	for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
-	{
-		BlockNumber	blkno;
-
-		blkno = meta->revmapArrayPages[i];
-		if (blkno == InvalidBlockNumber)
-			break;	/* XXX or continue? */
-		astate = accumArrayResult(astate, Int64GetDatum((int64) blkno),
-  false, INT8OID, CurrentMemoryContext);
-	}
-	if (astate == NULL)
-		nulls[2] = true;
-	else
-		values[2] = makeArrayResult(astate, CurrentMemoryContext);
+	values[2] = Int32GetDatum(meta->pagesPerRange);
+	values[3] = Int64GetDatum(meta->lastRevmapPage);
 
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
@@ -384,34 +365,6 @@ minmax_metapage_info(PG_FUNCTION_ARGS)
 }
 
 /*
- * Return the BlockNumber array stored in a revmap array page
- */
-Datum
-minmax_revmap_array_data(PG_FUNCTION_ARGS)
-{
-	bytea	   *raw_page = PG_GETARG_BYTEA_P(0);
-	Page		page;
-	ArrayBuildState *astate = NULL;
-	RevmapArrayContents *contents;
-	Datum		blkarr;
-	int			i;
-
-	page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP_ARRAY,
-			  "revmap array");
-
-	contents = (RevmapArrayContents *) PageGetContents(page);
-
-	for (i = 0; i < contents->rma_nblocks; i++)
-		astate = accumArrayResult(astate,
-  Int64GetDatum((int64) contents->rma_blocks[i]),
-  false, INT8OID, CurrentMemoryContext);
-	Assert(astate != NULL);
-
-	blkarr = makeArrayResult(astate, CurrentMemoryContext);
-	PG_RETURN_DATUM(blkarr);
-}
-
-/*
  * Return the TID array stored in a minmax revmap page
  */
 Datum
@@ -437,7 +390,7 @@ minmax_revmap_data(PG_FUNCTION_ARGS)
 	/* Extract values from the revmap page */
 	contents = (RevmapContents *) PageGetContents(page);
 	MemSet(nulls, 0, sizeof(nulls));
-	values[0] = Int64GetDatum((uint64) contents->rmr_logblk);
+	values[0] = Int64GetDatum((uint64) 0);
 
 	/* Extract (possibly empty) list of TIDs in this page. */
 	for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
diff --git a/contrib/pageinspect/pageinspect--1.2.sql b/contrib/pageinspect/pageinspect--1.2.sql
index 56c9ba8..cba90ca 100644
--- a/contrib/pageinspect/pageinspect--1.2.sql
+++ b/contrib/pageinspect/pageinspect--1.2.sql
@@ -110,7 +110,7 @@ LANGUAGE C STRICT;
 -- minmax_metapage_info()

Re: [HACKERS] Minmax indexes

2014-08-15 Thread Fujii Masao

On Fri, Aug 15, 2014 at 8:02 AM, Alvaro Herrera
 wrote:
> Alvaro Herrera wrote:
>> Heikki Linnakangas wrote:
>
>> > I'm sure this still needs some cleanup, but here's the patch, based
>> > on your v14. Now that I know what this approach looks like, I still
>> > like it much better. The insert and update code is somewhat more
>> > complicated, because you have to be careful to lock the old page,
>> > new page, and revmap page in the right order. But it's not too bad,
>> > and it gets rid of all the complexity in vacuum.
>>
>> It seems there is some issue here, because pageinspect tells me the
>> index is not growing properly for some reason.  minmax_revmap_data gives
>> me this array of TIDs after a bunch of insert/vacuum/delete/ etc:
>
> I fixed this issue, and did a lot more rework and bugfixing.  Here's
> v15, based on v14-heikki2.

I've not read the patch yet. But while testing the feature, I found that

* Brin index cannot be created on CHAR(n) column.
   Maybe other data types have the same problem.

* FILLFACTOR cannot be set in brin index.

Are these intentional?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-15 Thread Heikki Linnakangas


On 08/15/2014 10:26 AM, Heikki Linnakangas wrote:

On 08/15/2014 02:02 AM, Alvaro Herrera wrote:

Alvaro Herrera wrote:

Heikki Linnakangas wrote:



I'm sure this still needs some cleanup, but here's the patch, based
on your v14. Now that I know what this approach looks like, I still
like it much better. The insert and update code is somewhat more
complicated, because you have to be careful to lock the old page,
new page, and revmap page in the right order. But it's not too bad,
and it gets rid of all the complexity in vacuum.


It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason.  minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:


I fixed this issue, and did a lot more rework and bugfixing.  Here's
v15, based on v14-heikki2.


Thanks!


I think remaining issues are mostly minimal (pageinspect should output
block number alongside each tuple, now that we have it, for example.)


There's this one issue I left in my patch version that I think we should
do something about:


+   /*
+* No luck. Assume that the revmap was updated concurrently.
+*
+* XXX: it would be nice to add some kind of a sanity check 
here to
+* avoid looping infinitely, if the revmap points to wrong 
tuple for
+* some reason.
+*/


This happens when we follow the revmap to a tuple, but find that the
tuple points to a different block than what the revmap claimed.
Currently, we just assume that it's because the tuple was updated
concurrently, but while hacking, I frequently had a broken index where
the revmap pointed to bogus tuples or the tuples had a missing/wrong
block number on them, and ran into infinite loop here. It's clearly a
case of a corrupt index and shouldn't happen, but I would imagine that
it's a fairly typical way this would fail in production too because of
hardware issues or bugs. So I think we need to work a bit harder to stop
the looping and throw an error instead.

Perhaps something as simple as keeping a loop counter and giving up
after 1000 attempts would be good enough. The window between releasing
the lock on the revmap, and acquiring the lock on the page containing
the MMTuple is very narrow, so the chances of losing that race to a
concurrent update more than 1-2 times in a row is vanishingly small.


Reading the patch more closely, I see that you added a check that when 
we loop, we throw an error if the new item pointer in the revmap is the 
same as before. In theory, it's possible that two concurrent updates 
happen: one that moves the tuple we're looking for elsewhere, and 
another that moves it back again. The probability of that is also 
vanishingly small, so maybe that's OK. Or we could check the LSN; if the 
revmap has been updated, its LSN must've changed.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-15 Thread Heikki Linnakangas


On 08/15/2014 02:02 AM, Alvaro Herrera wrote:

Alvaro Herrera wrote:

Heikki Linnakangas wrote:



I'm sure this still needs some cleanup, but here's the patch, based
on your v14. Now that I know what this approach looks like, I still
like it much better. The insert and update code is somewhat more
complicated, because you have to be careful to lock the old page,
new page, and revmap page in the right order. But it's not too bad,
and it gets rid of all the complexity in vacuum.


It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason.  minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:


I fixed this issue, and did a lot more rework and bugfixing.  Here's
v15, based on v14-heikki2.


Thanks!


I think remaining issues are mostly minimal (pageinspect should output
block number alongside each tuple, now that we have it, for example.)


There's this one issue I left in my patch version that I think we should 
do something about:



+   /*
+* No luck. Assume that the revmap was updated concurrently.
+*
+* XXX: it would be nice to add some kind of a sanity check 
here to
+* avoid looping infinitely, if the revmap points to wrong 
tuple for
+* some reason.
+*/


This happens when we follow the revmap to a tuple, but find that the 
tuple points to a different block than what the revmap claimed. 
Currently, we just assume that it's because the tuple was updated 
concurrently, but while hacking, I frequently had a broken index where 
the revmap pointed to bogus tuples or the tuples had a missing/wrong 
block number on them, and ran into infinite loop here. It's clearly a 
case of a corrupt index and shouldn't happen, but I would imagine that 
it's a fairly typical way this would fail in production too because of 
hardware issues or bugs. So I think we need to work a bit harder to stop 
the looping and throw an error instead.


Perhaps something as simple as keeping a loop counter and giving up 
after 1000 attempts would be good enough. The window between releasing 
the lock on the revmap, and acquiring the lock on the page containing 
the MMTuple is very narrow, so the chances of losing that race to a 
concurrent update more than 1-2 times in a row is vanishingly small.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-12 Thread Alvaro Herrera

Heikki Linnakangas wrote:
> I couldn't resist starting to hack on this, and implemented the
> scheme I've been having in mind:
> 
> 1. MMTuple contains the block number of the heap page (range) that
> the tuple represents. Vacuum is no longer needed to clean up old
> tuples; when an index tuples is updated, the old tuple is deleted
> atomically with the insertion of a new tuple and updating the
> revmap, so no garbage is left behind.
> 
> 2. LockTuple is gone. When following the pointer from revmap to
> MMTuple, the block number is used to check that you land on the
> right tuple. If not, the search is started over, looking at the
> revmap again.

Thanks, looks good, yeah.  Did you just forget to attach the
access/rmgrdesc/minmaxdesc.c file, or did you ignore it altogether?
Anyway I hacked one up, and cleaned up some other things.

> I'm sure this still needs some cleanup, but here's the patch, based
> on your v14. Now that I know what this approach looks like, I still
> like it much better. The insert and update code is somewhat more
> complicated, because you have to be careful to lock the old page,
> new page, and revmap page in the right order. But it's not too bad,
> and it gets rid of all the complexity in vacuum.

It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason.  minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:

"(2,1)","(2,2)","(2,3)","(2,4)","(2,5)","(4,1)","(5,1)","(6,1)","(7,1)","(8,1)","(9,1)","(10,1)","(11,1)","(12,1)","(13,1)","(14,1)","(15,1)","(16,1)","(17,1)","(18,1)","(19,1)","(20,1)","(21,1)","(22,1)","(23,1)","(24,1)","(25,1)","(26,1)","(27,1)","(28,1)","(29,1)","(30,1)","(31,1)","(32,1)","(33,1)","(34,1)","(35,1)","(36,1)","(37,1)","(38,1)","(39,1)","(40,1)","(41,1)","(42,1)","(43,1)","(44,1)","(45,1)","(46,1)","(47,1)","(48,1)","(49,1)","(50,1)","(51,1)","(52,1)","(53,1)","(54,1)","(55,1)","(56,1)","(57,1)","(58,1)","(59,1)","(60,1)","(61,1)","(62,1)","(63,1)","(64,1)","(65,1)","(66,1)","(67,1)","(68,1)","(69,1)","(70,1)","(71,1)","(72,1)","(73,1)","(74,1)","(75,1)","(76,1)","(77,1)","(78,1)","(79,1)","(80,1)","(81,1)","(82,1)","(83,1)","(84,1)","(85,1)","(86,1)","(87,1)","(88,1)","(89,1)","(90,1)","(91,1)","(92,1)","(93,1)","(94,1)","(95,1)","(96,1)","(97,1)","(98,1)","(99,1)","(100,1)","(101,1)","(102,1)","(103,1)","(104,1)","(105,1)","(106,1)","(107,1)","(108,1)","(109,1)","(110,1)","(111,1)","(112,1)","(113,1)","(114,1)","(115,1)","(116,1)","(117,1)","(118,1)","(119,1)","(120,1)","(121,1)","(122,1)","(123,1)","(124,1)","(125,1)","(126,1)","(127,1)","(128,1)","(129,1)","(130,1)","(131,1)","(132,1)","(133,1)","(134,1)"

There are some who would think that getting one item per page is
suboptimal.  (Maybe it's just a missing FSM update somewhere.)


I've been hacking away a bit more at this; will post updated patch
probably tomorrow (was about to post but just found a memory stomp in
pageinspect.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-10 Thread Claudio Freire

On Fri, Aug 8, 2014 at 6:01 AM, Heikki Linnakangas
 wrote:
> It's possible that two backends arrive at phase 3 at the same time, with
> different values. For example, backend A wants to update the minimum to
> contain 10, and and backend B wants to update it to 5. Now, if backend B
> gets to update the tuple first, to 5, backend A will update the tuple to 10
> when it gets the lock, which is wrong.
>
> The simplest solution would be to get the buffer lock in exclusive mode to
> begin with, so that you don't need to release it between steps 2 and 5. That
> might be a significant hit on concurrency, though, when most of the
> insertions don't in fact have to update the value. Another idea is to
> re-check the updated values after acquiring the lock in exclusive mode, to
> see if they match the previous values.


No, the simplest solution is to re-check the bounds after acquiring
the exclusive lock. So instead of doing the addValue with share lock,
do a consistency check first, and if it's not consistent, do the
addValue with exclusive lock.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-10 Thread Heikki Linnakangas


On 08/10/2014 12:42 PM, Simon Riggs wrote:

On 8 August 2014 16:03, Heikki Linnakangas  wrote:


I couldn't resist starting to hack on this, and implemented the scheme I've
been having in mind:

1. MMTuple contains the block number of the heap page (range) that the tuple
represents. Vacuum is no longer needed to clean up old tuples; when an index
tuples is updated, the old tuple is deleted atomically with the insertion of
a new tuple and updating the revmap, so no garbage is left behind.

2. LockTuple is gone. When following the pointer from revmap to MMTuple, the
block number is used to check that you land on the right tuple. If not, the
search is started over, looking at the revmap again.


Part 2 sounds interesting, especially because of the reduction in CPU
that it might allow.

Part 1 doesn't sound good yet.
Are they connected?


Yes. The optimistic locking in part 2 is based on checking that the 
block number on the MMTuple matches what you're searching for, and that 
there is never more than one MMTuple in the index with the same block 
number.



More importantly, can't we tweak this after commit? Delaying commit
just means less time for other people to see, test, understand tune
and fix. I see you (Heikki) doing lots of incremental development,
lots of small commits. Can't we do this one the same?


Well, I wouldn't consider "let's redesign how locking and vacuuming 
works and change the on-disk format" as incremental development ;-). 
It's more like, well, redesigning the whole thing. Any testing and 
tuning would certainly need to be redone after such big changes.


If you agree that these changes make sense, let's do them now and not 
waste people's time testing and tuning a dead-end design. If you don't 
agree, then let's discuss that.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-10 Thread Heikki Linnakangas


On 08/10/2014 12:22 PM, Simon Riggs wrote:

On 8 August 2014 16:03, Heikki Linnakangas  wrote:


1. MMTuple contains the block number of the heap page (range) that the tuple
represents. Vacuum is no longer needed to clean up old tuples; when an index
tuples is updated, the old tuple is deleted atomically with the insertion of
a new tuple and updating the revmap, so no garbage is left behind.


What happens if the transaction that does this aborts? Surely that
means the new value is itself garbage? What cleans up that?


It's no different from Alvaro's patch. The updated MMTuple covers the 
aborted value, but that's OK from a correctnes point of view.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-10 Thread Simon Riggs

On 8 August 2014 16:03, Heikki Linnakangas  wrote:

> I couldn't resist starting to hack on this, and implemented the scheme I've
> been having in mind:
>
> 1. MMTuple contains the block number of the heap page (range) that the tuple
> represents. Vacuum is no longer needed to clean up old tuples; when an index
> tuples is updated, the old tuple is deleted atomically with the insertion of
> a new tuple and updating the revmap, so no garbage is left behind.
>
> 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the
> block number is used to check that you land on the right tuple. If not, the
> search is started over, looking at the revmap again.

Part 2 sounds interesting, especially because of the reduction in CPU
that it might allow.

Part 1 doesn't sound good yet.
Are they connected?

More importantly, can't we tweak this after commit? Delaying commit
just means less time for other people to see, test, understand tune
and fix. I see you (Heikki) doing lots of incremental development,
lots of small commits. Can't we do this one the same?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-10 Thread Simon Riggs

On 8 August 2014 10:01, Heikki Linnakangas  wrote:

> It's possible that two backends arrive at phase 3 at the same time, with
> different values. For example, backend A wants to update the minimum to
> contain 10, and and backend B wants to update it to 5. Now, if backend B
> gets to update the tuple first, to 5, backend A will update the tuple to 10
> when it gets the lock, which is wrong.
>
> The simplest solution would be to get the buffer lock in exclusive mode to
> begin with, so that you don't need to release it between steps 2 and 5. That
> might be a significant hit on concurrency, though, when most of the
> insertions don't in fact have to update the value. Another idea is to
> re-check the updated values after acquiring the lock in exclusive mode, to
> see if they match the previous values.

Simplest solution is to re-apply the test just before update, so in
the above example, if we think we want to lower the minimum to 10 and
when we get there it is already 5, we just don't update.

We don't need to do the re-check always, though. We can read the page
LSN while holding share lock, then re-read it once we acquire
exclusive lock. If LSN is the same, no need for datatype specific
re-checks at all.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-10 Thread Simon Riggs

On 8 August 2014 16:03, Heikki Linnakangas  wrote:

> 1. MMTuple contains the block number of the heap page (range) that the tuple
> represents. Vacuum is no longer needed to clean up old tuples; when an index
> tuples is updated, the old tuple is deleted atomically with the insertion of
> a new tuple and updating the revmap, so no garbage is left behind.

What happens if the transaction that does this aborts? Surely that
means the new value is itself garbage? What cleans up that?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-08 Thread Heikki Linnakangas


Another race condition:

If a new tuple is inserted to the range while summarization runs, it's 
possible that the new tuple isn't included in the tuple that the 
summarization calculated, nor does the insertion itself udpate it.


1. There is no index tuple for page range 1-10
2. Summarization begins. It scans pages 1-5.
3. A new insertion inserts a heap tuple to page 1.
4. The insertion sees that there is no index tuple covering range 1-10, 
so it doesn't update it.
5. The summarization finishes scanning pages 5-10, and inserts the new 
index tuple. The summarization didn't see the newly inserted heap tuple, 
and hence it's not included in the calculated index tuple.


One idea is to do the summarization in two stages. First, insert a 
placeholder tuple, with no real value in it. A query considers the 
placeholder tuple the same as a missing tuple, ie. always considers it a 
match. An insertion updates the placeholder tuple with the value 
inserted, as if it was a regular mmtuple. After summarization has 
finished scanning the page range, it turns the placeholder tuple into a 
regular tuple, by unioning the placeholder value with the value formed 
by scanning the heap.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-08 Thread Heikki Linnakangas

I think there's a race condition in mminsert, if two backends insert a 
tuple to the same heap page range concurrently. mminsert does this:


1. Fetch the MMtuple for the page range
2. Check if any of the stored datums need updating
3. Unlock the page.
4. Lock the page again in exclusive mode.
5. Update the tuple.

It's possible that two backends arrive at phase 3 at the same time, with 
different values. For example, backend A wants to update the minimum to 
contain 10, and and backend B wants to update it to 5. Now, if backend B 
gets to update the tuple first, to 5, backend A will update the tuple to 
10 when it gets the lock, which is wrong.


The simplest solution would be to get the buffer lock in exclusive mode 
to begin with, so that you don't need to release it between steps 2 and 
5. That might be a significant hit on concurrency, though, when most of 
the insertions don't in fact have to update the value. Another idea is 
to re-check the updated values after acquiring the lock in exclusive 
mode, to see if they match the previous values.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Josh Berkus

On 08/07/2014 05:52 PM, Michael Paquier wrote:
> On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus  wrote:
>> On 08/07/2014 08:38 AM, Oleg Bartunov wrote:
>>> +1 for BRIN !
>>>
>>> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs  wrote:
 On 7 August 2014 14:53, Robert Haas  wrote:
 A better description would be "block range index" since we are
 indexing a range of blocks (not just one block). Perhaps a better one
 would be simply "range index", which we could abbreviate to RIN or
 BRIN.
>>
>> How about Block Range Dynamic indexes?
>>
>> Or Range Usage Metadata indexes?
>>
>> You see what I'm getting at:
>>
>> BRanDy
>>
>> RUM
>>
>> ... to keep with our "new indexes" naming scheme ...
> Not the best fit for kids, fine for grad students.

But, it goes perfectly with our GIN and VODKA indexes.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Peter Geoghegan

On Thu, Aug 7, 2014 at 7:58 AM, Robert Haas  wrote:
> range index might get confused with range types; block range index
> seems better.  I like summary, but I'm fine with block range index or
> block filter index, too.

+1


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Michael Paquier

On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus  wrote:
> On 08/07/2014 08:38 AM, Oleg Bartunov wrote:
>> +1 for BRIN !
>>
>> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs  wrote:
>>> On 7 August 2014 14:53, Robert Haas  wrote:
>>> A better description would be "block range index" since we are
>>> indexing a range of blocks (not just one block). Perhaps a better one
>>> would be simply "range index", which we could abbreviate to RIN or
>>> BRIN.
>
> How about Block Range Dynamic indexes?
>
> Or Range Usage Metadata indexes?
>
> You see what I'm getting at:
>
> BRanDy
>
> RUM
>
> ... to keep with our "new indexes" naming scheme ...
Not the best fit for kids, fine for grad students.

BRIN seems to be a perfect consensus, so +1 for it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Josh Berkus

On 08/07/2014 08:38 AM, Oleg Bartunov wrote:
> +1 for BRIN !
> 
> On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs  wrote:
>> On 7 August 2014 14:53, Robert Haas  wrote:
>> A better description would be "block range index" since we are
>> indexing a range of blocks (not just one block). Perhaps a better one
>> would be simply "range index", which we could abbreviate to RIN or
>> BRIN.

How about Block Range Dynamic indexes?

Or Range Usage Metadata indexes?

You see what I'm getting at:

BRanDy

RUM

... to keep with our "new indexes" naming scheme ...


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Petr Jelinek


On 07/08/14 16:16, Simon Riggs wrote:


A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.



+1 for block range index (BRIN)

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Nicolas Barbier

2014-08-07 Oleg Bartunov :

> +1 for BRIN !

+1, rolls off the tongue smoothly and captures the essence :-).

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Oleg Bartunov

+1 for BRIN !

On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs  wrote:
> On 7 August 2014 14:53, Robert Haas  wrote:
>> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
>>  wrote:
>>> 2014-08-06 Claudio Freire :
>>>
 So, I like blockfilter a lot. I change my vote to blockfilter ;)
>>>
>>> +1 for blockfilter, because it stresses the fact that the "physical"
>>> arrangement of rows in blocks matters for this index.
>>
>> I don't like that quite as well as summary, but I'd prefer either to
>> the current naming.
>
> Yes, "summary index" isn't good. I'm not sure where the block or the
> filter part comes in though, so -1 to "block filter", not least
> because it doesn't have a good abbreviation (bfin??).
>
> A better description would be "block range index" since we are
> indexing a range of blocks (not just one block). Perhaps a better one
> would be simply "range index", which we could abbreviate to RIN or
> BRIN.
>
> --
>  Simon Riggs   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Robert Haas

On Thu, Aug 7, 2014 at 10:16 AM, Simon Riggs  wrote:
> On 7 August 2014 14:53, Robert Haas  wrote:
>> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
>>  wrote:
>>> 2014-08-06 Claudio Freire :
>>>
 So, I like blockfilter a lot. I change my vote to blockfilter ;)
>>>
>>> +1 for blockfilter, because it stresses the fact that the "physical"
>>> arrangement of rows in blocks matters for this index.
>>
>> I don't like that quite as well as summary, but I'd prefer either to
>> the current naming.
>
> Yes, "summary index" isn't good. I'm not sure where the block or the
> filter part comes in though, so -1 to "block filter", not least
> because it doesn't have a good abbreviation (bfin??).
>
> A better description would be "block range index" since we are
> indexing a range of blocks (not just one block). Perhaps a better one
> would be simply "range index", which we could abbreviate to RIN or
> BRIN.

range index might get confused with range types; block range index
seems better.  I like summary, but I'm fine with block range index or
block filter index, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Alvaro Herrera

Simon Riggs wrote:
> On 7 August 2014 14:53, Robert Haas  wrote:
> > On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
> >  wrote:
> >> 2014-08-06 Claudio Freire :
> >>
> >>> So, I like blockfilter a lot. I change my vote to blockfilter ;)
> >>
> >> +1 for blockfilter, because it stresses the fact that the "physical"
> >> arrangement of rows in blocks matters for this index.
> >
> > I don't like that quite as well as summary, but I'd prefer either to
> > the current naming.
> 
> Yes, "summary index" isn't good. I'm not sure where the block or the
> filter part comes in though, so -1 to "block filter", not least
> because it doesn't have a good abbreviation (bfin??).

I was thinking just "blockfilter" (I did show a sample command).
Claudio explained the name downthread; personally, of all the options
suggested thus far, it's the one I like the most (including minmax).

At this point, the naming issue is what is keeping me from committing
this patch, so the quicker we can solve it, the merrier.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Claudio Freire

On Thu, Aug 7, 2014 at 11:16 AM, Simon Riggs  wrote:
> On 7 August 2014 14:53, Robert Haas  wrote:
>> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
>>  wrote:
>>> 2014-08-06 Claudio Freire :
>>>
 So, I like blockfilter a lot. I change my vote to blockfilter ;)
>>>
>>> +1 for blockfilter, because it stresses the fact that the "physical"
>>> arrangement of rows in blocks matters for this index.
>>
>> I don't like that quite as well as summary, but I'd prefer either to
>> the current naming.
>
> Yes, "summary index" isn't good. I'm not sure where the block or the
> filter part comes in though, so -1 to "block filter", not least
> because it doesn't have a good abbreviation (bfin??).

Block filter would refer to the index property that selects blocks,
not tuples, and it does so through a "filter function" (for min-max,
it's a range check, but for other opclasses it could be anything).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Simon Riggs

On 7 August 2014 14:53, Robert Haas  wrote:
> On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
>  wrote:
>> 2014-08-06 Claudio Freire :
>>
>>> So, I like blockfilter a lot. I change my vote to blockfilter ;)
>>
>> +1 for blockfilter, because it stresses the fact that the "physical"
>> arrangement of rows in blocks matters for this index.
>
> I don't like that quite as well as summary, but I'd prefer either to
> the current naming.

Yes, "summary index" isn't good. I'm not sure where the block or the
filter part comes in though, so -1 to "block filter", not least
because it doesn't have a good abbreviation (bfin??).

A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-07 Thread Robert Haas

On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
 wrote:
> 2014-08-06 Claudio Freire :
>
>> So, I like blockfilter a lot. I change my vote to blockfilter ;)
>
> +1 for blockfilter, because it stresses the fact that the "physical"
> arrangement of rows in blocks matters for this index.

I don't like that quite as well as summary, but I'd prefer either to
the current naming.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-06 Thread Nicolas Barbier

2014-08-06 Claudio Freire :

> So, I like blockfilter a lot. I change my vote to blockfilter ;)

+1 for blockfilter, because it stresses the fact that the "physical"
arrangement of rows in blocks matters for this index.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-06 Thread Claudio Freire

On Wed, Aug 6, 2014 at 1:55 PM, Alvaro Herrera  wrote:
> Claudio Freire wrote:
>> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera  
>> wrote:
>> > CREATE INDEX foo ON t USING crange (cols)   -- misspelling of "cringe"?
>> > CREATE INDEX foo ON t USING comprange (cols)
>> > CREATE INDEX foo ON t USING compressedrng (cols)   -- ugh
>> > -- or use an identifier with whitespace:
>> > CREATE INDEX foo ON t USING "compressed range" (cols)
>>
>> The word you'd use there is not necessarily the one you use on the
>> framework, since the framework applies to many such techniques, but
>> the index type there is one specific one.
>>
>> The create command can still use minmax, or rangemap if you prefer
>> that, while the framework's code uses summary or summarizing.
>
> I think you're confusing the AM name with the opclass name.  The name
> you specify in that part of the command is the access method name.  You
> can specify the opclass together with each column, like so:
>
> CREATE INDEX foo ON t USING blockfilter
> (order_date date_minmax_ops, geometry gis_bbox_ops);

Oh, uh... no, I'm not confusing them, but now I just realized how one
would implement other classes of block filtering indexes, and yeah...
you do it through the opclasses.

I'm sticking to bloom filters:

CREATE INDEX foo ON t USING blockfilter (order_date date_minmax_ops,
path character_bloom_ops);

Cool. Very cool.

So, I like blockfilter a lot. I change my vote to blockfilter ;)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-06 Thread Claudio Freire

On Wed, Aug 6, 2014 at 1:35 PM, Bruce Momjian  wrote:
> On Wed, Aug  6, 2014 at 01:31:14PM -0300, Claudio Freire wrote:
>> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera  
>> wrote:
>> > CREATE INDEX foo ON t USING crange (cols)   -- misspelling of "cringe"?
>> > CREATE INDEX foo ON t USING comprange (cols)
>> > CREATE INDEX foo ON t USING compressedrng (cols)   -- ugh
>> > -- or use an identifier with whitespace:
>> > CREATE INDEX foo ON t USING "compressed range" (cols)
>>
>>
>> The word you'd use there is not necessarily the one you use on the
>> framework, since the framework applies to many such techniques, but
>> the index type there is one specific one.
>
> "Block filter" indexes?


Nice one


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-06 Thread Claudio Freire

On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera  wrote:
> CREATE INDEX foo ON t USING crange (cols)   -- misspelling of "cringe"?
> CREATE INDEX foo ON t USING comprange (cols)
> CREATE INDEX foo ON t USING compressedrng (cols)   -- ugh
> -- or use an identifier with whitespace:
> CREATE INDEX foo ON t USING "compressed range" (cols)

The word you'd use there is not necessarily the one you use on the
framework, since the framework applies to many such techniques, but
the index type there is one specific one.

The create command can still use minmax, or rangemap if you prefer
that, while the framework's code uses summary or summarizing.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-06 Thread Bruce Momjian

On Wed, Aug  6, 2014 at 01:31:14PM -0300, Claudio Freire wrote:
> On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera  
> wrote:
> > CREATE INDEX foo ON t USING crange (cols)   -- misspelling of "cringe"?
> > CREATE INDEX foo ON t USING comprange (cols)
> > CREATE INDEX foo ON t USING compressedrng (cols)   -- ugh
> > -- or use an identifier with whitespace:
> > CREATE INDEX foo ON t USING "compressed range" (cols)
> 
> 
> The word you'd use there is not necessarily the one you use on the
> framework, since the framework applies to many such techniques, but
> the index type there is one specific one.

"Block filter" indexes?

> The create command can still use minmax, or rangemap if you prefer
> that, while the framework's code uses summary or summarizing.

"Summary" sounds like materialized views to me.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-06 Thread Claudio Freire

On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera  wrote:
> "Summary" seems good.  If I get enough votes I can change it to that.
>
> CREATE INDEX foo ON t USING summary (cols)
>
> "Summarizing" seems weird on that command.  Not sure about "compressed
> range", as you would have to use an abbreviation or run the words
> together.


Summarizing index sounds better to my ears, but both ideas based on
"summary" are quite succint and to-the-point descriptions of what's
happening, so I vote for those.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-06 Thread Robert Haas

On Tue, Aug 5, 2014 at 7:55 PM, Josh Berkus  wrote:
> On 08/05/2014 04:41 PM, Alvaro Herrera wrote:
>> I have chosen to keep the name "minmax", even if the opclasses now let
>> one implement completely different things on top of it such as geometry
>> bounding boxes and bloom filters (aka bitmap indexes).  I don't see a
>> need for a rename: essentially, in PR we can just say "we have these
>> neat minmax indexes that other databases also have, but instead of just
>> being used for integer data, they can also be used for geometry, GIS and
>> bitmap indexes, so as always we're more powerful than everyone else when
>> implementing new database features".
>
> Plus we haven't come up with a better name ...

Several good suggestions have been made, like "summarizing" or
"summary" indexes and "compressed range" indexes.  I still really
dislike the present name - you might think this is a type of index
that has something to do with optimizing "min" and "max", but what it
really is is a kind of small index for a big table.  The current name
couldn't make that less clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-05 Thread Alvaro Herrera

FWIW I think I haven't responded appropriately to the points raised by
Heikki.  Basically, as I see it there are three main items:

1. the revmap physical-to-logical mapping is too complex; let's use
something else.

We had revmap originally in a separate fork.  The current approach grew
out of the necessity of putting it in the main fork while ensuring that
fast access to individual pages is possible.  There are of course many
ways to skin this cat; Heikki's proposal is to have it always occupy the
first few physical pages, rather than require a logical-to-physical
mapping table.  To implement this he proposes to move other pages out of
the way as the index grows.  I don't really have much love for this
idea.  We can change how this is implemented later in the cycle, if we
find that a different approach is better than my proposal.  I don't want
to spend endless time meddling with this (and I definitely don't want to
have this delay the eventual commit of the patch.)


2. vacuuming is not optimal

Right now, to eliminate garbage index tuples we need to first scan
the revmap to figure out which tuples are unreferenced.  There is a
concern that if there's an excess of dead tuples, the index becomes
unvacuumable because palloc() fails due to request size.  This is
largely theoretical because in order for this to happen there need to be
several million dead index tuples.  As a minimal fix to alleviate this
problem without requiring a complete rework of vacuuming, we can cap
that palloc request to maintenance_work_mem and remove dead tuples in a
loop instead of trying to remove all of them in a single pass.

Another thing proposed was to store range numbers (or just heap page
numbers) within each index tuple.  I felt that this would add more bloat
unnecessarily.  However, there is some padding space in index tuple that
maybe we can use to store range numbers.  I will think some more about
how we can use this to simplify vacuuming.


3. avoid MMTuple as it is just unnecessary extra complexity.

The main thing that MMTuple adds is not the fact that we save 2 bytes
by storing BlockNumber as is instead of within a TID field.  Instead,
it's that we can construct and deconstruct using our own design, which
means we can use however many Datum entries we want and however many
"null" flags.  In normal heap and index tuples, there are always the
same number of datum/nulls.  In minmax, the number of nulls is twice the
number of indexed columns; the number of datum values is determined by
how many datum values are stored per opclass ("sortable" opclasses
store 2 columns, but geometry would store only one).  If we were to use
regular IndexTuples, we would lose that .. and I have no idea how it
would work.  In other words, MMTuples look fine to me.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-08-05 Thread Josh Berkus

On 08/05/2014 04:41 PM, Alvaro Herrera wrote:
> I have chosen to keep the name "minmax", even if the opclasses now let
> one implement completely different things on top of it such as geometry
> bounding boxes and bloom filters (aka bitmap indexes).  I don't see a
> need for a rename: essentially, in PR we can just say "we have these
> neat minmax indexes that other databases also have, but instead of just
> being used for integer data, they can also be used for geometry, GIS and
> bitmap indexes, so as always we're more powerful than everyone else when
> implementing new database features".

Plus we haven't come up with a better name ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-29 Thread Heikki Linnakangas


On 07/10/2014 12:41 AM, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

On 06/23/2014 08:07 PM, Alvaro Herrera wrote:



I feel that the below would nevertheless be simpler:


I wonder if it would be simpler to just always store the revmap
pages in the beginning of the index, before any other pages. Finding
the revmap page would then be just as easy as with a separate fork.
When the table/index is extended so that a new revmap page is
needed, move the existing page at that block out of the way. Locking
needs some consideration, but I think it would be feasible and
simpler than you have now.


Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.


Hmm. Two alternative schemes come to mind:

1. Move each index tuple off the page individually, updating the
revmap while you do it, until the page is empty. Updating the revmap
for a single index tuple isn't difficult; you have to do it anyway
when an index tuple is replaced. (MMTuples don't contain a heap
block number ATM, but IMHO they should, see below)

2. Store the new block number of the page that you moved out of the
way in the revmap page, and leave the revmap pointers unchanged. The
revmap pointers can be updated later, lazily.

Both of those seem pretty straightforward.


The trouble I have with moving blocks around to make space, is that it
would cause the index to have periodic hiccups to make room for the new
revmap pages.  One nice property that these indexes are supposed to have
is that the effect into insertion times should be pretty minimal.  That
would cease to be the case if we have to do your proposed block moves.


Approach 2 above is fairly quick, quick enough that no-one would notice 
the "hiccup". Moving the tuples individually (approach 1) would be slower.



ISTM that when the old tuple cannot be updated in-place, the new
index tuple is inserted with mm_doinsert(), but the old tuple is
never deleted.


It's deleted by the next vacuum.


Ah I see. Vacuum reads the whole index, and builds an in-memory hash
table that contains an ItemPointerData for every tuple in the index.
Doesn't that require a lot of memory, for a large index? That might
be acceptable - you ought to have plenty of RAM if you're pushing
around multi-terabyte tables - but it would nevertheless be nice to
not have a hard requirement for something as essential as vacuum.


I guess if you're expecting that pages_per_range=1 is a common case,
yeah it might become an issue eventually.


Not sure, but I find it easier to think of the patch that way. In any 
case, it would be nice to avoid the problem, even if it's not common.



One idea I just had is to
have a bit for each index tuple, which is set whenever the revmap no
longer points to it.  That way, vacuuming is much easier: just scan the
index and delete all tuples having that bit set.


The bit needs to be set atomically with the insertion of the new tuple, 
so why not just remove the old tuple right away?



Wouldn't it be simpler to remove the old tuple atomically with
inserting the new tuple and updating the revmap? Or at least mark
the old tuple as deletable, so that vacuum can just delete it,
without building the large hash table to determine that it's
deletable.


Yes, it might be simpler, but it'd require dirtying more pages on
insertions (and holding more page-level locks, for longer.  Not good for
concurrent access).


I wouldn't worry much about the performance and concurrency of this 
operation. Remember that the majority of updates are expected to not 
have to update the index, otherwise the minmax index will degenerate 
quickly and performance will suck anyway. And even when updating the 
index is needed, in most cases the new tuple fits on the same page, 
after removing the old one. So the case where you have to insert a new 
index tuple, remove old one (or mark it dead), and update the revmap to 
point to the new tuple, is rare.



I'm quite surprised by the use of LockTuple on the index tuples. I
think the main reason for needing that is the fact that MMTuple
doesn't store the heap (range) block number that the tuple points
to: LockTuple is required to ensure that the tuple doesn't go away
while a scan is following a pointer from the revmap to it. If the
MMTuple contained the BlockNumber, a scan could check that and go
back to the revmap if it doesn't match. Alternatively, you could
keep the revmap page locked when you follow a pointer to the regular
index page.


There's the intention that these accesses be kept as concurrent as
possible; this is why we don't want to block the whole page.  Locking
individual TIDs is fine in this case (which is not in SELECT FOR UPDATE)
because we can only lock a single tuple in any one index scan, so
there's no unbounded growth of the lock table.

I prefer not to have BlockNumbers in index tuples, because that would
make them larger for not much gain.  That data would mostly be
redundant, and would be necessary on

Re: [HACKERS] Minmax indexes

2014-07-14 Thread Robert Haas

On Wed, Jul 9, 2014 at 5:16 PM, Alvaro Herrera  wrote:
> The way it works now, each opclass needs to have three support
> procedures; I've called them getOpers, maybeUpdateValues, and compare.
> (I realize these names are pretty bad, and will be changing them.)
> getOpers is used to obtain information about what is stored for that
> data type; it says how many datum values are stored for a column of that
> type (two for sortable: min and max), and how many operators it needs
> setup.  Then, the generic code fills in a MinmaxDesc(riptor) and creates
> an initial DeformedMMTuple (which is a rather ugly name for a minmax
> tuple held in memory).  The maybeUpdateValues amproc can then be called
> when there's a new heap tuple, which updates the DeformedMMTuple to
> account for the new tuple (in essence, it's a union of the original
> values and the new tuple).  This can be done repeatedly (when a new
> index is being created) or only once (when a new heap tuple is inserted
> into an existing index).  There is no need for an "aggregate".
>
> This DeformedMMTuple can easily be turned into the on-disk
> representation; there is no hardcoded assumption on the number of index
> values stored per heap column, so it is possible to build an opclass
> that stores a bounding box column for a geometry heap column, for
> instance.
>
> Then we have the "compare" amproc.  This is used during index scans;
> after extracting an index tuple, it is turned into DeformedMMTuple, and
> the "compare" amproc for each column is called with the values of scan
> keys.  (Now that I think about this, it seems pretty much what
> "consistent" is for GiST opclasses).  A true return value indicates that
> the scan key matches the page range boundaries and thus all pages in the
> range are added to the output TID bitmap.

This sounds really great.  I agree that it needs some renaming.  I
think renaming what you are calling "compare" to "consistent" would be
an excellent idea, to match GiST.  "maybeUpdateValues" sounds like it
does the equivalent of GIST's "compress" on the new value followed by
a "union" with the existing summary item.  I don't think it's
necessary to separate those out, though.  You could perhaps call it
something like "add_item".

Also, FWIW, I liked Peter's idea of calling these "summarizing
indexes" or perhaps "summary" would be a bit shorter and mean the same
thing.  "minmax" wouldn't be the end of the world, but since you've
gone to the trouble of making this more generic I think giving it a
more generic name would be a very good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-11 Thread Claudio Freire

On Fri, Jul 11, 2014 at 3:47 PM, Greg Stark  wrote:
> On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire  
> wrote:
>> Marking as read-only is ok, or emitting a NOTICE so that if anyone
>> changes those parameters that change the shape of the index, they know
>> it needs a rebuild would be OK too. Both mechanisms work for me.
>
> We don't actually have any of these mechanisms. They wouldn't be bad
> things to have but I don't think we should gate adding new types of
> indexes on adding them. In particular, the index could just hard code
> a value for these parameters and having them be parameterized is
> clearly better even if that doesn't produce all the warnings or
> rebuild things automatically or whatever.


No, I agree, it's just a nice to have.

But at least the docs should mention it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-11 Thread Greg Stark

On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire  wrote:
> Marking as read-only is ok, or emitting a NOTICE so that if anyone
> changes those parameters that change the shape of the index, they know
> it needs a rebuild would be OK too. Both mechanisms work for me.

We don't actually have any of these mechanisms. They wouldn't be bad
things to have but I don't think we should gate adding new types of
indexes on adding them. In particular, the index could just hard code
a value for these parameters and having them be parameterized is
clearly better even if that doesn't produce all the warnings or
rebuild things automatically or whatever.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-11 Thread Claudio Freire

On Thu, Jul 10, 2014 at 4:20 PM, Alvaro Herrera
 wrote:
> Claudio Freire wrote:
>> On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera  
>> wrote:
>> > Another thing I noticed is that version 8 of the patch blindly believed
>> > the "pages_per_range" declared in catalogs.  This meant that if somebody
>> > did "alter index foo set pages_per_range=123" the index would
>> > immediately break (i.e. return corrupted results when queried).  I have
>> > fixed this by storing the pages_per_range value used to construct the
>> > index in the metapage.  Now if you do the ALTER INDEX thing, the new
>> > value is only used when the index is recreated by REINDEX.
>>
>> This seems a lot like parameterizing.
>
> I don't understand what that means -- care to elaborate?

We've been talking about bloom filters, and how their shape differs
according to the parameters of the bloom filter (number of hashes,
hash type, etc).

But after seeing this case of pages_per_range, I noticed it's an
effective-enough mechanism. Like:

CREATE INDEX ix_blah ON some_table USING bloom (somecol)
  WITH (BLOOM_HASHES=15, BLOOM_BUCKETS=1024, PAGES_PER_RANGE=64);

Marking as read-only is ok, or emitting a NOTICE so that if anyone
changes those parameters that change the shape of the index, they know
it needs a rebuild would be OK too. Both mechanisms work for me.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-11 Thread Alvaro Herrera

Fujii Masao wrote:
> On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera
>  wrote:

> > Here's a new version of this patch, which is more generic the original
> > versions, and similar to what you describe.
> 
> I've not read the discussion so far at all, but I found the problem
> when I played with this patch. Sorry if this has already been discussed.
> 
> =# create table test as select num from generate_series(1,10) num;
> SELECT 10
> =# create index testidx on test using minmax (num);
> CREATE INDEX
> =# alter table test alter column num type text;
> ERROR:  could not determine which collation to use for string comparison
> HINT:  Use the COLLATE clause to set the collation explicitly.

Ah, yes, I need to pass down collation OIDs to comparison functions.
That's marked as XXX in various places in the code.  Sorry I forgot to
mention that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-11 Thread Simon Riggs

On 10 July 2014 00:13, Alvaro Herrera  wrote:
> Josh Berkus wrote:
>> On 07/09/2014 02:16 PM, Alvaro Herrera wrote:
>> > The way it works now, each opclass needs to have three support
>> > procedures; I've called them getOpers, maybeUpdateValues, and compare.
>> > (I realize these names are pretty bad, and will be changing them.)
>>
>> I kind of like "maybeUpdateValues".  Very ... NoSQL-ish.  "Maybe update
>> the values, maybe not."  ;-)
>
> :-)  Well, that's exactly what happens.  If we insert a new tuple into
> the table, and the existing summarizing tuple (to use Peter's term)
> already covers it, then we don't need to update the index tuple at all.
> What this name doesn't say is what values are to be maybe-updated.

There are lots of functions that maybe-do-things, that's just modular
programming. Not sure we need to prefix things with maybe to explain
that, otherwise we'd have maybeXXX everywhere.

More descriptive name would be MaintainIndexBounds() or similar.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-11 Thread Simon Riggs

On 9 July 2014 23:54, Peter Geoghegan  wrote:
> On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera  
> wrote:
>> All this being said, I'm sticking to the name "Minmax indexes".  There
>> was a poll in pgsql-advocacy
>> http://www.postgresql.org/message-id/53a0b4f8.8080...@agliodbs.com
>> about a new name, but there were no suggestions supported by more than
>> one person.  If a brilliant new name comes up, I'm open to changing it.
>
> How about "summarizing indexes"? That seems reasonably descriptive.

-1 for another name change. That boat sailed some months back.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Fujii Masao

On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera
 wrote:
> Claudio Freire wrote:
>
>> An aggregate to generate a "compressed set" from several values
>> A function which adds a new value to the "compressed set" and returns
>> the new "compressed set"
>> A function which tests if a value is in a "compressed set"
>> A function which tests if a "compressed set" overlaps another
>> "compressed set" of equal type
>>
>> If you can define different compressed sets, you can use this to
>> generate both min/max indexes as well as bloom filter indexes. Whether
>> we'd want to have both is perhaps questionable, but having the ability
>> to is probably desirable.
>
> Here's a new version of this patch, which is more generic the original
> versions, and similar to what you describe.

I've not read the discussion so far at all, but I found the problem
when I played with this patch. Sorry if this has already been discussed.

=# create table test as select num from generate_series(1,10) num;
SELECT 10
=# create index testidx on test using minmax (num);
CREATE INDEX
=# alter table test alter column num type text;
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Greg Stark

On Thu, Jul 10, 2014 at 10:29 PM, Alvaro Herrera
 wrote:
>
> What I think should happen is that if the value is changed, the index
> sholud be rebuilt right there.

I disagree. It would be a non-orthogonal interface if ALTER TABLE
sometimes causes the index to be rebuilt and sometimes just makes a
configuration change. I already see a lot of user confusion when some
ALTER TABLE commands rewrite the table and some are quick meta data
changes.

Especially in this case where the type of configuration being changed
is just an internal storage parameter and the user visible shape of
the index is unchanged it would be weird to rebuild the index.

IMHO the "right" thing to do is just to say this parameter is
read-only and have the AM throw an error when the user changes it. But
even that would require an AM callback for the AM to even know about
the change.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Jeff Janes

On Thu, Jul 10, 2014 at 2:30 PM, Jaime Casanova 
wrote:

> On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus  wrote:
> > On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
> >>> So I guess the only thing left is to issue a NOTICE when said alter
> >>> > takes place (I don't see that on the patch, but maybe it's there?)
> >> That's not in the patch.  I don't think we have an appropriate place to
> >> emit such a notice.
> >
> > What do you mean by "don't have an appropriate place"?
> >
> > The suggestion is that when a user does:
> >
> > ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100
> >
> > they should get a NOTICE:
> >
> > "NOTICE: changes to pages per range will not take effect until the index
> > is REINDEXed"
> >
> > otherwise, we're going to get a lot of "I Altered the pages per range,
> > but performance didn't change" emails.
> >
>
> How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
> from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?
>
> we don't get a notice for these cases either
>


I think those are different.  They don't rewrite existing data in the
table, but they are applied to new (and updated) data. My understanding is
that changing  PAGES_PER_RANGE will have no effect on future data until a
re-index is done, even if the entire table eventually turns over.

Cheers,

Jeff

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Josh Berkus

On 07/10/2014 02:30 PM, Jaime Casanova wrote:
> How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
> from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?
> 
> we don't get a notice for these cases either

Good idea.  We should also emit notices for those.  Well, maybe not for
fillfactor.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Alvaro Herrera

Josh Berkus wrote:
> On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
> >> So I guess the only thing left is to issue a NOTICE when said alter
> >> > takes place (I don't see that on the patch, but maybe it's there?)
> > That's not in the patch.  I don't think we have an appropriate place to
> > emit such a notice.
> 
> What do you mean by "don't have an appropriate place"?

What I think should happen is that if the value is changed, the index
sholud be rebuilt right there.  But there is no way to have this occur
from the generic tablecmds.c code.  Maybe we should extend the AM
interface so that they are notified of changes and can take action.
Inserting AM-specific code into tablecmds.c seems pretty wrong to me --
existing stuff for WITH CHECK OPTION views notwithstanding.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Jaime Casanova

On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus  wrote:
> On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
>>> So I guess the only thing left is to issue a NOTICE when said alter
>>> > takes place (I don't see that on the patch, but maybe it's there?)
>> That's not in the patch.  I don't think we have an appropriate place to
>> emit such a notice.
>
> What do you mean by "don't have an appropriate place"?
>
> The suggestion is that when a user does:
>
> ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100
>
> they should get a NOTICE:
>
> "NOTICE: changes to pages per range will not take effect until the index
> is REINDEXed"
>
> otherwise, we're going to get a lot of "I Altered the pages per range,
> but performance didn't change" emails.
>

How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?

we don't get a notice for these cases either

-- 
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Josh Berkus

On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
>> So I guess the only thing left is to issue a NOTICE when said alter
>> > takes place (I don't see that on the patch, but maybe it's there?)
> That's not in the patch.  I don't think we have an appropriate place to
> emit such a notice.

What do you mean by "don't have an appropriate place"?

The suggestion is that when a user does:

ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100

they should get a NOTICE:

"NOTICE: changes to pages per range will not take effect until the index
is REINDEXed"

otherwise, we're going to get a lot of "I Altered the pages per range,
but performance didn't change" emails.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Alvaro Herrera

Claudio Freire wrote:
> On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera  
> wrote:
> > Another thing I noticed is that version 8 of the patch blindly believed
> > the "pages_per_range" declared in catalogs.  This meant that if somebody
> > did "alter index foo set pages_per_range=123" the index would
> > immediately break (i.e. return corrupted results when queried).  I have
> > fixed this by storing the pages_per_range value used to construct the
> > index in the metapage.  Now if you do the ALTER INDEX thing, the new
> > value is only used when the index is recreated by REINDEX.
> 
> This seems a lot like parameterizing.

I don't understand what that means -- care to elaborate?

> So I guess the only thing left is to issue a NOTICE when said alter
> takes place (I don't see that on the patch, but maybe it's there?)

That's not in the patch.  I don't think we have an appropriate place to
emit such a notice.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Claudio Freire

On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera  wrote:
> Another thing I noticed is that version 8 of the patch blindly believed
> the "pages_per_range" declared in catalogs.  This meant that if somebody
> did "alter index foo set pages_per_range=123" the index would
> immediately break (i.e. return corrupted results when queried).  I have
> fixed this by storing the pages_per_range value used to construct the
> index in the metapage.  Now if you do the ALTER INDEX thing, the new
> value is only used when the index is recreated by REINDEX.

This seems a lot like parameterizing. So I guess the only thing left
is to issue a NOTICE when said alter takes place (I don't see that on
the patch, but maybe it's there?)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-09 Thread Greg Stark

On Wed, Jul 9, 2014 at 10:16 PM, Alvaro Herrera
 wrote:
> there is no hardcoded assumption on the number of index
> values stored per heap column, so it is possible to build an opclass
> that stores a bounding box column for a geometry heap column, for
> instance.

I think the more Postgresy thing to do is to store one datum per heap
column. It's up to the opclass to find or make a composite data type
that stores all the necessary state. So you could make a minmax_accum
data type like NumericAggState in numeric.c:numeric_accum() or the
array of floats in float8_accum. For a bounding box a 2d geometric
min/max index could use the "box" data type for example. The way
you've done it seems more convenient but there's something to be said
for using the same style for different areas. A single bounding box
accumulator function would probably suffice for both an aggregate and
index opclass for example.

But this sounds pretty great. I think it would let me do the bloom
filter index I had in mind fairly straightforwardly. The result would
be something very similar to a bitmap index. I'm not sure if there's a
generic term that includes bitmap indexes or other summary functions
like bounding boxes (which min/max is basically -- a 1D bounding box).

Thanks a lot for listening and being so open, I think what you
describe is a lot more flexible than what you had before and I can see
some pretty great things coming out of it (including min/max itself of
course).
-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-09 Thread Alvaro Herrera

Josh Berkus wrote:
> On 07/09/2014 02:16 PM, Alvaro Herrera wrote:
> > The way it works now, each opclass needs to have three support
> > procedures; I've called them getOpers, maybeUpdateValues, and compare.
> > (I realize these names are pretty bad, and will be changing them.)
> 
> I kind of like "maybeUpdateValues".  Very ... NoSQL-ish.  "Maybe update
> the values, maybe not."  ;-)

:-)  Well, that's exactly what happens.  If we insert a new tuple into
the table, and the existing summarizing tuple (to use Peter's term)
already covers it, then we don't need to update the index tuple at all.
What this name doesn't say is what values are to be maybe-updated.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-09 Thread Peter Geoghegan

On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera  wrote:
> All this being said, I'm sticking to the name "Minmax indexes".  There
> was a poll in pgsql-advocacy
> http://www.postgresql.org/message-id/53a0b4f8.8080...@agliodbs.com
> about a new name, but there were no suggestions supported by more than
> one person.  If a brilliant new name comes up, I'm open to changing it.

How about "summarizing indexes"? That seems reasonably descriptive.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-09 Thread Josh Berkus

On 07/09/2014 02:16 PM, Alvaro Herrera wrote:
> The way it works now, each opclass needs to have three support
> procedures; I've called them getOpers, maybeUpdateValues, and compare.
> (I realize these names are pretty bad, and will be changing them.)

I kind of like "maybeUpdateValues".  Very ... NoSQL-ish.  "Maybe update
the values, maybe not."  ;-)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-09 Thread Alvaro Herrera

Heikki Linnakangas wrote:
> On 06/23/2014 08:07 PM, Alvaro Herrera wrote:

> I feel that the below would nevertheless be simpler:
> 
> >>I wonder if it would be simpler to just always store the revmap
> >>pages in the beginning of the index, before any other pages. Finding
> >>the revmap page would then be just as easy as with a separate fork.
> >>When the table/index is extended so that a new revmap page is
> >>needed, move the existing page at that block out of the way. Locking
> >>needs some consideration, but I think it would be feasible and
> >>simpler than you have now.
> >
> >Moving index items around is not easy, because you'd have to adjust the
> >revmap to rewrite the item pointers.
> 
> Hmm. Two alternative schemes come to mind:
> 
> 1. Move each index tuple off the page individually, updating the
> revmap while you do it, until the page is empty. Updating the revmap
> for a single index tuple isn't difficult; you have to do it anyway
> when an index tuple is replaced. (MMTuples don't contain a heap
> block number ATM, but IMHO they should, see below)
> 
> 2. Store the new block number of the page that you moved out of the
> way in the revmap page, and leave the revmap pointers unchanged. The
> revmap pointers can be updated later, lazily.
> 
> Both of those seem pretty straightforward.

The trouble I have with moving blocks around to make space, is that it
would cause the index to have periodic hiccups to make room for the new
revmap pages.  One nice property that these indexes are supposed to have
is that the effect into insertion times should be pretty minimal.  That
would cease to be the case if we have to do your proposed block moves.

> >>ISTM that when the old tuple cannot be updated in-place, the new
> >>index tuple is inserted with mm_doinsert(), but the old tuple is
> >>never deleted.
> >
> >It's deleted by the next vacuum.
> 
> Ah I see. Vacuum reads the whole index, and builds an in-memory hash
> table that contains an ItemPointerData for every tuple in the index.
> Doesn't that require a lot of memory, for a large index? That might
> be acceptable - you ought to have plenty of RAM if you're pushing
> around multi-terabyte tables - but it would nevertheless be nice to
> not have a hard requirement for something as essential as vacuum.

I guess if you're expecting that pages_per_range=1 is a common case,
yeah it might become an issue eventually.  One idea I just had is to
have a bit for each index tuple, which is set whenever the revmap no
longer points to it.  That way, vacuuming is much easier: just scan the
index and delete all tuples having that bit set.  No need for this hash
table stuff.  I am still concerned with adding more overhead whenever a
page range is modified, so that insertions in the table continue to be
fast.  If we're going to dirty the index every time, it might not be so
fast anymore.  But then maybe I'm worrying about nothing; I will have to
measure how slower it is.

> Wouldn't it be simpler to remove the old tuple atomically with
> inserting the new tuple and updating the revmap? Or at least mark
> the old tuple as deletable, so that vacuum can just delete it,
> without building the large hash table to determine that it's
> deletable.

Yes, it might be simpler, but it'd require dirtying more pages on
insertions (and holding more page-level locks, for longer.  Not good for
concurrent access).

> I'm quite surprised by the use of LockTuple on the index tuples. I
> think the main reason for needing that is the fact that MMTuple
> doesn't store the heap (range) block number that the tuple points
> to: LockTuple is required to ensure that the tuple doesn't go away
> while a scan is following a pointer from the revmap to it. If the
> MMTuple contained the BlockNumber, a scan could check that and go
> back to the revmap if it doesn't match. Alternatively, you could
> keep the revmap page locked when you follow a pointer to the regular
> index page.

There's the intention that these accesses be kept as concurrent as
possible; this is why we don't want to block the whole page.  Locking
individual TIDs is fine in this case (which is not in SELECT FOR UPDATE)
because we can only lock a single tuple in any one index scan, so
there's no unbounded growth of the lock table.

I prefer not to have BlockNumbers in index tuples, because that would
make them larger for not much gain.  That data would mostly be
redundant, and would be necessary only for vacuuming.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-23 Thread Robert Haas

On Thu, Jun 19, 2014 at 12:32 PM, Greg Stark  wrote:
> On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas
>  wrote:
>> Implementing something is a good way to demonstrate how it would look like.
>> But no, I don't insist on implementing every possible design whenever a new
>> feature is proposed.
>>
>> I liked Greg's sketch of what the opclass support functions would be. It
>> doesn't seem significantly more complicated than what's in the patch now.
>
> As a counter-point to my own point there will be nothing stopping us
> in the future from generalizing things. Dealing with catalogs is
> mostly book-keeping headaches and careful work. it's something that
> might be well-suited for a GSOC or first patch from someone looking to
> familiarize themselves with the system architecture. It's hard to
> invent a whole new underlying infrastructure at the same time as
> dealing with all that book-keeping and it's hard for someone
> familiarizing themselves with the system to also have a great new
> idea. Having tasks like this that are easy to explain and that mentor
> understands well can be easier to manage than tasks where the newcomer
> has some radical new idea.

Generalizing this in the future would be highly likely to change the
on-disk format for existing indexes, which would be a problem for
pg_upgrade.  I think we will likely be stuck with whatever the initial
on-disk format looks like for a very long time, which is why I think
we need to try rather hard to get this right the first time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-23 Thread Heikki Linnakangas


On 06/23/2014 08:07 PM, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

With 8k block size, that's just enough to cover the full range of
2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each
regular revmap page can store about 8192/6 = 1365 item pointers,
each array revmap page can store about 8192/4 = 2048 block
references, and the size of the top array is 8192/4. That's just
enough; to store the required number of array pages in the top
array, the array needs to be (2^32/1365)/2048)=1536 elements large.

But with 4k or smaller blocks, it's not enough.


Yeah.  As I said elsewhere, actual useful values are likely to be close
to the read-ahead setting of the underlying disk; by default that'd be
16 pages (128kB), but I think it's common advice to increase the kernel
setting to improve performance.


My gut feeling is that it might well be best to set pages_per_page=1. 
Even if you do the same amount of I/O, thanks to kernel read-ahead, you 
might still avoid processing a lot of tuples. But would need to see some 
benchmarks to know..



I don't think we don't need to prevent
minmax indexes with pages_per_range=1, but I don't think we need to
ensure that that setting works with the largest tables, either, because
it doesn't make any sense to set it up like that.

Also, while there are some recommendations to set up a system with
larger page sizes (32kB), I have never seen any recommendation to set it
lower.  It wouldn't make sense to build a system that has very large
tables and use a smaller page size.

So in other words, yes, you're correct that the mechanism doesn't work
in some cases (small page size and index configured for highest level of
detail), but the conditions are such that I don't think it matters.

ISTM the thing to do here is to do the math at index creation time, and
if we find that we don't have enough space in the metapage for all array
revmap pointers we need, bail out and require the index to be created
with a larger pages_per_range setting.


Yeah, I agree that would be acceptable.

I feel that the below would nevertheless be simpler:


I wonder if it would be simpler to just always store the revmap
pages in the beginning of the index, before any other pages. Finding
the revmap page would then be just as easy as with a separate fork.
When the table/index is extended so that a new revmap page is
needed, move the existing page at that block out of the way. Locking
needs some consideration, but I think it would be feasible and
simpler than you have now.


Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.


Hmm. Two alternative schemes come to mind:

1. Move each index tuple off the page individually, updating the revmap 
while you do it, until the page is empty. Updating the revmap for a 
single index tuple isn't difficult; you have to do it anyway when an 
index tuple is replaced. (MMTuples don't contain a heap block number 
ATM, but IMHO they should, see below)


2. Store the new block number of the page that you moved out of the way 
in the revmap page, and leave the revmap pointers unchanged. The revmap 
pointers can be updated later, lazily.


Both of those seem pretty straightforward.


I have followed the suggestion by Amit to overwrite the index tuple when
a new heap tuple is inserted, instead of creating a separate index
tuple.  This saves a lot of index bloat.  This required a new entry
point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
function PageIndexDeleteNoCompact which is similar in spirit to
PageIndexMultiDelete except that item pointers do not change.  This is
necessary because the revmap stores item pointers, and such reference
would break if we were to renumber items in index pages.


ISTM that when the old tuple cannot be updated in-place, the new
index tuple is inserted with mm_doinsert(), but the old tuple is
never deleted.


It's deleted by the next vacuum.


Ah I see. Vacuum reads the whole index, and builds an in-memory hash 
table that contains an ItemPointerData for every tuple in the index. 
Doesn't that require a lot of memory, for a large index? That might be 
acceptable - you ought to have plenty of RAM if you're pushing around 
multi-terabyte tables - but it would nevertheless be nice to not have a 
hard requirement for something as essential as vacuum.


In addition to the hash table, remove_deletable_tuples() pallocs an 
array to hold an ItemPointer for every index tuple about to be removed. 
A single palloc is limited to 1GB, so that will fail outright if there 
are more than ~179 million dead index tuples. You're unlikely to hit 
that in practice, but if you do, you'll never be able to vacuum the 
index. So that's not very nice.


Wouldn't it be simpler to remove the old tuple atomically with inserting 
the new tuple and updating the revmap? Or at least mark the old tuple as 
deletable, so that vacuum can just delete it, without building the large 
hash t

Re: [HACKERS] Minmax indexes

2014-06-23 Thread Alvaro Herrera

Heikki Linnakangas wrote:
> Some comments, aside from the design wrt. bounding boxes etc. :

Thanks.  I haven't commented on that sub-thread because I think it's
possible to come up with a reasonable design that solves the issue by
adding a couple of amprocs.  I need to do some more thinking to ensure
it is really workable, and then I'll post my ideas.

> On 06/15/2014 05:34 AM, Alvaro Herrera wrote:
> >Robert Haas wrote:

> If I understand the code correctly, the revmap is a three-level deep
> structure. The bottom level consists of "regular revmap pages", and
> each regular revmap page is filled with ItemPointerDatas, which
> point to the index tuples. The middle level consists of "array
> revmap pages", and each array revmap page contains an array of
> BlockNumbers of the "regular revmap" pages. The top level is an
> array of BlockNumbers of the array revmap pages, and it is stored in
> the metapage.

Yep, that's correct.  Essentially, we still have the revmap as a linear
space (containing TIDs); the other two levels on top of that are only
there to enable locating the physical page numbers for each revmap
logical page.  I make one exception that the first logical revmap page
is always stored in page 1, to optimize the case of a smallish table
(~1360 page ranges; approximately 1.3 gigabytes of data at 128 pages per
range, or 170 megabytes at 16 pages per range.)

Each page has a page header (24 bytes) and special space (4 bytes), so
it has 8192-28=8164 bytes available for data, so 1360 item pointers.

> With 8k block size, that's just enough to cover the full range of
> 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each
> regular revmap page can store about 8192/6 = 1365 item pointers,
> each array revmap page can store about 8192/4 = 2048 block
> references, and the size of the top array is 8192/4. That's just
> enough; to store the required number of array pages in the top
> array, the array needs to be (2^32/1365)/2048)=1536 elements large.
> 
> But with 4k or smaller blocks, it's not enough.

Yeah.  As I said elsewhere, actual useful values are likely to be close
to the read-ahead setting of the underlying disk; by default that'd be
16 pages (128kB), but I think it's common advice to increase the kernel
setting to improve performance.  I don't think we don't need to prevent
minmax indexes with pages_per_range=1, but I don't think we need to
ensure that that setting works with the largest tables, either, because
it doesn't make any sense to set it up like that.

Also, while there are some recommendations to set up a system with
larger page sizes (32kB), I have never seen any recommendation to set it
lower.  It wouldn't make sense to build a system that has very large
tables and use a smaller page size.

So in other words, yes, you're correct that the mechanism doesn't work
in some cases (small page size and index configured for highest level of
detail), but the conditions are such that I don't think it matters.

ISTM the thing to do here is to do the math at index creation time, and
if we find that we don't have enough space in the metapage for all array
revmap pointers we need, bail out and require the index to be created
with a larger pages_per_range setting.

> I wonder if it would be simpler to just always store the revmap
> pages in the beginning of the index, before any other pages. Finding
> the revmap page would then be just as easy as with a separate fork.
> When the table/index is extended so that a new revmap page is
> needed, move the existing page at that block out of the way. Locking
> needs some consideration, but I think it would be feasible and
> simpler than you have now.

Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.

> >I have followed the suggestion by Amit to overwrite the index tuple when
> >a new heap tuple is inserted, instead of creating a separate index
> >tuple.  This saves a lot of index bloat.  This required a new entry
> >point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
> >function PageIndexDeleteNoCompact which is similar in spirit to
> >PageIndexMultiDelete except that item pointers do not change.  This is
> >necessary because the revmap stores item pointers, and such reference
> >would break if we were to renumber items in index pages.
> 
> ISTM that when the old tuple cannot be updated in-place, the new
> index tuple is inserted with mm_doinsert(), but the old tuple is
> never deleted.

It's deleted by the next vacuum.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-23 Thread Heikki Linnakangas


Some comments, aside from the design wrt. bounding boxes etc. :

On 06/15/2014 05:34 AM, Alvaro Herrera wrote:

Robert Haas wrote:

On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
 wrote:

Here's an updated version of this patch, with fixes to all the bugs
reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.


I'm not very happy with the use of a separate relation fork for
storing this data.


Here's a new version of this patch.  Now the revmap is not stored in a
separate fork, but together with all the regular data, as explained
elsewhere in the thread.


Thanks! Please update the README accordingly.

If I understand the code correctly, the revmap is a three-level deep 
structure. The bottom level consists of "regular revmap pages", and each 
regular revmap page is filled with ItemPointerDatas, which point to the 
index tuples. The middle level consists of "array revmap pages", and 
each array revmap page contains an array of BlockNumbers of the "regular 
revmap" pages. The top level is an array of BlockNumbers of the array 
revmap pages, and it is stored in the metapage.


With 8k block size, that's just enough to cover the full range of 2^32-1 
blocks that you'll need if you set mm_pages_per_range=1. Each regular 
revmap page can store about 8192/6 = 1365 item pointers, each array 
revmap page can store about 8192/4 = 2048 block references, and the size 
of the top array is 8192/4. That's just enough; to store the required 
number of array pages in the top array, the array needs to be 
(2^32/1365)/2048)=1536 elements large.


But with 4k or smaller blocks, it's not enough.

I wonder if it would be simpler to just always store the revmap pages in 
the beginning of the index, before any other pages. Finding the revmap 
page would then be just as easy as with a separate fork. When the 
table/index is extended so that a new revmap page is needed, move the 
existing page at that block out of the way. Locking needs some 
consideration, but I think it would be feasible and simpler than you 
have now.



I have followed the suggestion by Amit to overwrite the index tuple when
a new heap tuple is inserted, instead of creating a separate index
tuple.  This saves a lot of index bloat.  This required a new entry
point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
function PageIndexDeleteNoCompact which is similar in spirit to
PageIndexMultiDelete except that item pointers do not change.  This is
necessary because the revmap stores item pointers, and such reference
would break if we were to renumber items in index pages.


ISTM that when the old tuple cannot be updated in-place, the new index 
tuple is inserted with mm_doinsert(), but the old tuple is never deleted.


- Heikki

--
- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-21 Thread Martijn van Oosterhout

I'm sorry if I missed something, but ISTM this is beginning to look a
lot like GiST. This was pointed out by Robert Haas last year.

On Wed, Jun 18, 2014 at 12:09:42PM -0300, Claudio Freire wrote:
> So, you have:
> 
> An aggregate to generate a "compressed set" from several values

Which GiST does by calling 'compress' on each value, and the 'unions' the
results together.

> A function which adds a new value to the "compressed set" and returns
> the new "compressed set"

Again, 'compress' + 'union'

> A function which tests if a value is in a "compressed set"

Which GiST does using 'compress' +'consistant'

> A function which tests if a "compressed set" overlaps another
> "compressed set" of equal type

Which GiST calls 'consistant'

So I'm wondering why you can't just reuse the btree_gist functions we
already have in contrib.  It seems to me that these MinMax indexes are
in fact a variation on GiST that indexes the pages of a table based
upon the 'union' of all the elements in a page.  By reusing the GiST
operator class you get support for many datatypes for free.

> If you can define different compressed sets, you can use this to
> generate both min/max indexes as well as bloom filter indexes. Whether
> we'd want to have both is perhaps questionable, but having the ability
> to is probably desirable.

You could implement bloom filter in GiST too. It's been discussed
before but I can't find any implementation. Probably because the filter
needs to be parameterised and if you store the bloom filter for each
element it gets expensive very quickly. However, hooked into a minmax
structure which only indexes whole pages it could be quite efficient.

> One problem with such a generalized implementation would be, that I'm
> not sure in-place modification of the "compressed set" on-disk can be
> assumed to be safe on all cases. Surely, for strictly-enlarging sets
> it would, but while min/max and bloom filters both fit the bill, it's
> not clear that one can assume this for all structures.

I think GiST has already solved this problem.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer

signature.asc
Description: Digital signature

Re: [HACKERS] Minmax indexes

2014-06-19 Thread Claudio Freire

On Wed, Jun 18, 2014 at 8:51 AM, Heikki Linnakangas
 wrote:
>
> I liked Greg's sketch of what the opclass support functions would be. It
> doesn't seem significantly more complicated than what's in the patch now.

Which was

On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark  wrote:
> An aggregate to generate a min/max "bounding box" from several values
> A function which takes an "bounding box" and a new value and returns
> the new "bounding box"
> A function which tests if a value is in a "bounding box"
> A function which tests if a "bounding box" overlaps a "bounding box"

Which I'd generalize a bit further by renaming "bounding box" with
"compressed set", and allow it to be parameterized.

So, you have:

An aggregate to generate a "compressed set" from several values
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
A function which tests if a value is in a "compressed set"
A function which tests if a "compressed set" overlaps another
"compressed set" of equal type

If you can define different compressed sets, you can use this to
generate both min/max indexes as well as bloom filter indexes. Whether
we'd want to have both is perhaps questionable, but having the ability
to is probably desirable.

One problem with such a generalized implementation would be, that I'm
not sure in-place modification of the "compressed set" on-disk can be
assumed to be safe on all cases. Surely, for strictly-enlarging sets
it would, but while min/max and bloom filters both fit the bill, it's
not clear that one can assume this for all structures.

Adding also a "in-place updateable" bit to the "type" would perhaps
inflate the complexity of the patch due to the need to provide both
code paths?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-19 Thread Claudio Freire

On Thu, Jun 19, 2014 at 10:06 AM, Heikki Linnakangas
 wrote:
> On 06/18/2014 06:09 PM, Claudio Freire wrote:
>>
>> On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark  wrote:
>>>
>>> An aggregate to generate a min/max "bounding box" from several values
>>> A function which takes an "bounding box" and a new value and returns
>>> the new "bounding box"
>>> A function which tests if a value is in a "bounding box"
>>> A function which tests if a "bounding box" overlaps a "bounding box"
>>
>>
>> Which I'd generalize a bit further by renaming "bounding box" with
>> "compressed set", and allow it to be parameterized.
>
>
> What do you mean by parameterized?

Bloom filters can be paired with number of hashes, number of bit
positions, and hash function, so it's not a simple bloom filter index,
but a bloom filter index with N SHA-1-based hashes spread on a
K-length bitmap.

>> So, you have:
>>
>> An aggregate to generate a "compressed set" from several values
>> A function which adds a new value to the "compressed set" and returns
>> the new "compressed set"
>> A function which tests if a value is in a "compressed set"
>> A function which tests if a "compressed set" overlaps another
>> "compressed set" of equal type
>
>
> Yeah, something like that. I'm not sure I like the "compressed set" term any
> more than bounding box, though. GiST seems to have avoided naming the thing,
> and just talks about "index entries". But if we can come up with a good
> name, that would be more clear.

I don't want to use the term bloom filter since it's very specific of
a specific technique, but it's basically that - an approximate set
without false negatives. Ie: compressed set.

It's not a bounding box either when using bloom filters. So...

>> One problem with such a generalized implementation would be, that I'm
>> not sure in-place modification of the "compressed set" on-disk can be
>> assumed to be safe on all cases. Surely, for strictly-enlarging sets
>> it would, but while min/max and bloom filters both fit the bill, it's
>> not clear that one can assume this for all structures.
>
>
> I don't understand what you mean. It's a fundamental property of minmax
> indexes that you can always replace the "min" or "max" or "compressing set"
> or "bounding box" or whatever with another datum that represents all the
> keys that the old one did, plus some.

Yes, and bloom filters happen to fall on that category too.

Never mind what I said. I was thinking of other potential and
imaginary implementation that supports removal or updates, that might
need care with transaction lifetimes, but that's easily fixed by
letting vacuum or some lazy process do the deleting just as it happens
with other indexes anyway.

So, I guess the interface must include also the invariant that
compressed sets only grow, never shrink unless from a rebuild or a
vacuum operation.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-19 Thread Greg Stark

On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas
 wrote:
> Implementing something is a good way to demonstrate how it would look like.
> But no, I don't insist on implementing every possible design whenever a new
> feature is proposed.
>
> I liked Greg's sketch of what the opclass support functions would be. It
> doesn't seem significantly more complicated than what's in the patch now.

As a counter-point to my own point there will be nothing stopping us
in the future from generalizing things. Dealing with catalogs is
mostly book-keeping headaches and careful work. it's something that
might be well-suited for a GSOC or first patch from someone looking to
familiarize themselves with the system architecture. It's hard to
invent a whole new underlying infrastructure at the same time as
dealing with all that book-keeping and it's hard for someone
familiarizing themselves with the system to also have a great new
idea. Having tasks like this that are easy to explain and that mentor
understands well can be easier to manage than tasks where the newcomer
has some radical new idea.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-19 Thread Tom Lane

Vik Fearing  writes:
> On 06/18/2014 12:46 PM, Andres Freund wrote:
>> So to implement a feature one now has to implement the most generic
>> variant as a prototype first? Really?

> Well, there is the inventor's paradox to consider.

I have not seen anyone demanding a different implementation in this
thread.  What *has* been asked for, and not supplied, is a concrete
defense of the particular level of generality that's been selected
in this implementation.  It's not at all clear to the rest of us
whether it was the right choice, and that is something that ought
to be asked now not later.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-19 Thread Heikki Linnakangas


On 06/18/2014 06:09 PM, Claudio Freire wrote:

On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark  wrote:

An aggregate to generate a min/max "bounding box" from several values
A function which takes an "bounding box" and a new value and returns
the new "bounding box"
A function which tests if a value is in a "bounding box"
A function which tests if a "bounding box" overlaps a "bounding box"


Which I'd generalize a bit further by renaming "bounding box" with
"compressed set", and allow it to be parameterized.


What do you mean by parameterized?


So, you have:

An aggregate to generate a "compressed set" from several values
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
A function which tests if a value is in a "compressed set"
A function which tests if a "compressed set" overlaps another
"compressed set" of equal type


Yeah, something like that. I'm not sure I like the "compressed set" term 
any more than bounding box, though. GiST seems to have avoided naming 
the thing, and just talks about "index entries". But if we can come up 
with a good name, that would be more clear.



One problem with such a generalized implementation would be, that I'm
not sure in-place modification of the "compressed set" on-disk can be
assumed to be safe on all cases. Surely, for strictly-enlarging sets
it would, but while min/max and bloom filters both fit the bill, it's
not clear that one can assume this for all structures.


I don't understand what you mean. It's a fundamental property of minmax 
indexes that you can always replace the "min" or "max" or "compressing 
set" or "bounding box" or whatever with another datum that represents 
all the keys that the old one did, plus some.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-19 Thread Vik Fearing

On 06/18/2014 12:46 PM, Andres Freund wrote:
>>> Isn't 'simpler implementation' a valid reason that's already been
>>> > >discussed onlist? Obviously simpler implementation doesn't trump
>>> > >everything, but it's one valid reason...
>>> > >Note that I have zap to do with the design of this feature. I work for
>>> > >the same company as Alvaro, but that's pretty much it...
>> > 
>> > Without some analysis (e.g implementing it and comparing), I don't buy that
>> > it makes the implementation simpler to restrict it in this way. Maybe it
>> > does, but often it's actually simpler to solve the general case.
>
> So to implement a feature one now has to implement the most generic
> variant as a prototype first? Really?

Well, there is the inventor's paradox to consider.
-- 
Vik


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-18 Thread Robert Haas

On Tue, Jun 17, 2014 at 2:16 PM, Andres Freund  wrote:
>> But I'm not trying to say that we absolutely have to support that kind
>> of thing; what I am trying to say is that there should be a README or
>> a mailing list post or some such that says: "We thought about how
>> generic to make this.  We considered A, B, and C.  We rejected C as
>> too narrow, and A because if we made it that general it would have
>> greatly enlarged the disk footprint for the following reasons.
>> Therefore we selected B."
>
> Isn't 'simpler implementation' a valid reason that's already been
> discussed onlist? Obviously simpler implementation doesn't trump
> everything, but it's one valid reason...
> Note that I have zap to do with the design of this feature. I work for
> the same company as Alvaro, but that's pretty much it...

It really *hasn't* been discussed on-list.  See these emails,
discussing the same ideas, from 8 months ago:

http://www.postgresql.org/message-id/5249b2d3.6030...@vmware.com
http://www.postgresql.org/message-id/ca+tgmoyscbw-uc8lqv96szignxsuzayqbfdqmk-fgu22hdk...@mail.gmail.com

Now, Alvaro did not respond to those emails, nor did anyone involved
in the development of the feature.  There may be an argument that
implementing that would be too complicated, but Heikki said he didn't
think it would be, and nobody's made a concrete argument as to why
he's wrong (and Heikki knows a lot about indexing).

>> Basically, I think Heikki asked a good
>> question - which was "could we abstract this more?" - and I can't
>> recall seeing a clear answer explaining why we could or couldn't and
>> what the trade-offs would be.
>
> 'could we abstract more' imo is a pretty bad design guideline. It's 'is
> there benefit in abstracting more'. Otherwise you end up with way to
> complicated systems.

On the flip side, if you don't abstract enough, you end up being able
to cover only a small set of the relevant use cases, or else you end
up with a bunch of poorly-coordinated tools to cover slightly
different use cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-18 Thread Heikki Linnakangas


On 06/18/2014 01:46 PM, Andres Freund wrote:

On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote:

The main problem with using it for geometric types is that you can't easily
CLUSTER the table to make the minmax index effective again. But there are
ways around that.


Which are? Sure you can try stuff like recreating the table, sorting
rows with boundary boxes area above threshold first, and then go on to
sort by the lop left corner of the bounding box.


Right, something like that. Or cluster using some other column that 
correlates with the geometry, like a zip code.



But that'll be neither
builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER
for types with a btree opclass which will yield the perfect order.


Sure.

BTW, CLUSTERing by a geometric type would be useful anyway, even without 
minmax indexes.



Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...


Without some analysis (e.g implementing it and comparing), I don't buy that
it makes the implementation simpler to restrict it in this way. Maybe it
does, but often it's actually simpler to solve the general case.


So to implement a feature one now has to implement the most generic
variant as a prototype first? Really?


Implementing something is a good way to demonstrate how it would look 
like. But no, I don't insist on implementing every possible design 
whenever a new feature is proposed.


I liked Greg's sketch of what the opclass support functions would be. It 
doesn't seem significantly more complicated than what's in the patch now.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-18 Thread Andres Freund

On 2014-06-17 16:48:07 -0700, Greg Stark wrote:
> On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund  
> wrote:
> > Well, it might be doable to correlate them along one axis, but along
> > both?  That's more complicated... And even alongside one axis you
> > already get into problems if your geometries are irregularly sized.
> > Asingle large polygon will completely destroy indexability for anything
> > stored physically close by because suddently the minmax range will be
> > huge... So you'll need to cleverly sort for that as well.
> 
> I think there's a misunderstanding here, possibly mine. My
> understanding is that a min/max index will always be exactly the same
> size for a given size table. It stores the minimum and maximum value
> of the key for each page. Then you can do a bitmap scan by comparing
> the search key with each page's minimum and maximum to see if that
> page needs to be included in the scan. The failure mode is not that
> the index is large but that a page that has an outlier will be
> included in every scan as a false positive incurring an extra iop.

I just rechecked, and no, it doesn't, by default, store a range for each
page. It's MINMAX_DEFAULT_PAGES_PER_RANGE=128 pages by
default... Haven't checked what's the lowest it can be se tto.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-18 Thread Andres Freund

On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote:
> On 06/17/2014 09:16 PM, Andres Freund wrote:
> >Well, it might be doable to correlate them along one axis, but along
> >both?  That's more complicated... And even alongside one axis you
> >already get into problems if your geometries are irregularly sized.
> 
> Sure, there are cases where it would be useless. But it's easy to imagine
> scenarios where it would work well, where points are loaded in clusters and
> points that are close to each other also end up physically close to each
> other.

> >Asingle large polygon will completely destroy indexability for anything
> >stored physically close by because suddently the minmax range will be
> >huge... So you'll need to cleverly sort for that as well.
> 
> That's an inherent risk with minmax indexes: insert a few rows to the
> "wrong" locations in the heap, and the selectivity of the index degrades
> rapidly.

Sure. But it's fairly normal to have natural clusteredness in many
columns (surrogate keys, dateseries type of data). Even if you insert
geometric types in a geographic clusters you'll have worse results
because some bounding boxes will be big and such.

And:
> The main problem with using it for geometric types is that you can't easily
> CLUSTER the table to make the minmax index effective again. But there are
> ways around that.

Which are? Sure you can try stuff like recreating the table, sorting
rows with boundary boxes area above threshold first, and then go on to
sort by the lop left corner of the bounding box. But that'll be neither
builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER
for types with a btree opclass which will yield the perfect order.

> >Isn't 'simpler implementation' a valid reason that's already been
> >discussed onlist? Obviously simpler implementation doesn't trump
> >everything, but it's one valid reason...
> >Note that I have zap to do with the design of this feature. I work for
> >the same company as Alvaro, but that's pretty much it...
> 
> Without some analysis (e.g implementing it and comparing), I don't buy that
> it makes the implementation simpler to restrict it in this way. Maybe it
> does, but often it's actually simpler to solve the general case.

So to implement a feature one now has to implement the most generic
variant as a prototype first? Really?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-18 Thread Heikki Linnakangas


On 06/17/2014 09:16 PM, Andres Freund wrote:

On 2014-06-17 12:14:00 -0400, Robert Haas wrote:

On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund  wrote:

Well, I'm not the guy who does things with geometric data, but I don't
want to ignore the significant percentage of our users who are.  As
you must surely know, the GIST implementations for geometric data
types store bounding boxes on internal pages, and that seems to be
useful to people.  What is your reason for thinking that it would be
any less useful in this context?


For me minmax indexes are helpful because they allow to generate *small*
'coarse' indexes over large volumes of data. From my pov that's possible
possible because they don't contain item pointers for every contained
row.
That'ill imo work well if there are consecutive rows in the table that
can be summarized into one min/max range. That's quite likely to happen
for common applications of number of scalar datatypes. But the
likelihood of placing sufficiently many rows with very similar bounding
boxes close together seems much less relevant in practice. And I think
that's generally likely for operations which can't be well represented
as btree opclasses - the substructure that implies inside a Datum will
make correlation between consecutive rows less likely.


Well, I don't know: suppose you're loading geospatial data showing the
location of every building in some country.  It might easily be the
case that the data is or can be loaded in an order that provides
pretty good spatial locality, leading to tight bounding boxes over
physically consecutive data ranges.


Well, it might be doable to correlate them along one axis, but along
both?  That's more complicated... And even alongside one axis you
already get into problems if your geometries are irregularly sized.


Sure, there are cases where it would be useless. But it's easy to 
imagine scenarios where it would work well, where points are loaded in 
clusters and points that are close to each other also end up physically 
close to each other.



Asingle large polygon will completely destroy indexability for anything
stored physically close by because suddently the minmax range will be
huge... So you'll need to cleverly sort for that as well.


That's an inherent risk with minmax indexes: insert a few rows to the 
"wrong" locations in the heap, and the selectivity of the index degrades 
rapidly.


The main problem with using it for geometric types is that you can't 
easily CLUSTER the table to make the minmax index effective again. But 
there are ways around that.



But I'm not trying to say that we absolutely have to support that kind
of thing; what I am trying to say is that there should be a README or
a mailing list post or some such that says: "We thought about how
generic to make this.  We considered A, B, and C.  We rejected C as
too narrow, and A because if we made it that general it would have
greatly enlarged the disk footprint for the following reasons.
Therefore we selected B."


Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...


Without some analysis (e.g implementing it and comparing), I don't buy 
that it makes the implementation simpler to restrict it in this way. 
Maybe it does, but often it's actually simpler to solve the general case.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Greg Stark

On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund  wrote:
> Well, it might be doable to correlate them along one axis, but along
> both?  That's more complicated... And even alongside one axis you
> already get into problems if your geometries are irregularly sized.
> Asingle large polygon will completely destroy indexability for anything
> stored physically close by because suddently the minmax range will be
> huge... So you'll need to cleverly sort for that as well.

I think there's a misunderstanding here, possibly mine. My
understanding is that a min/max index will always be exactly the same
size for a given size table. It stores the minimum and maximum value
of the key for each page. Then you can do a bitmap scan by comparing
the search key with each page's minimum and maximum to see if that
page needs to be included in the scan. The failure mode is not that
the index is large but that a page that has an outlier will be
included in every scan as a false positive incurring an extra iop.

I don't think it's implausible at all that Geometric data would work
well. If you load Geometric data it's very common to load data by
geographic area so that all objects in San Francisco in one part of
the data load, probably even by zip code or census block.

What operations would an opclass for min/max need? I think it would be
pretty similar to the operators that GiST needs (thankfully minus the
baroque page split function):

An aggregate to generate a min/max "bounding box" from several values
A function which takes an "bounding box" and a new value and returns
the new "bounding box"
A function which tests if a value is in a "bounding box"
A function which tests if a "bounding box" overlaps a "bounding box"

The nice thing is this would let us add things like range @> (contains
element) to the plain integer min/max case.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Josh Berkus

On 06/17/2014 09:14 AM, Robert Haas wrote:
> Well, I don't know: suppose you're loading geospatial data showing the
> location of every building in some country.  It might easily be the
> case that the data is or can be loaded in an order that provides
> pretty good spatial locality, leading to tight bounding boxes over
> physically consecutive data ranges.

I admin a production application which has exactly this.  However, that
application doesn't have big enough data to benefit from minmax indexes;
it uses the basic spatial indexes.

So, my $0.02: bounding box minmax falls under the heading of "would be
nice to have, but not if it delays the feature".

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Claudio Freire

On Tue, Jun 17, 2014 at 1:04 PM, Andres Freund  wrote:
> For me minmax indexes are helpful because they allow to generate *small*
> 'coarse' indexes over large volumes of data. From my pov that's possible
> possible because they don't contain item pointers for every contained
> row.


But minmax is just a specific form of bloom filter.

This could certainly be generalized to a bloom filter index with some
set of bloom&hashing operators (minmax being just one).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Andres Freund

On 2014-06-17 12:14:00 -0400, Robert Haas wrote:
> On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund  
> wrote:
> >> Well, I'm not the guy who does things with geometric data, but I don't
> >> want to ignore the significant percentage of our users who are.  As
> >> you must surely know, the GIST implementations for geometric data
> >> types store bounding boxes on internal pages, and that seems to be
> >> useful to people.  What is your reason for thinking that it would be
> >> any less useful in this context?
> >
> > For me minmax indexes are helpful because they allow to generate *small*
> > 'coarse' indexes over large volumes of data. From my pov that's possible
> > possible because they don't contain item pointers for every contained
> > row.
> > That'ill imo work well if there are consecutive rows in the table that
> > can be summarized into one min/max range. That's quite likely to happen
> > for common applications of number of scalar datatypes. But the
> > likelihood of placing sufficiently many rows with very similar bounding
> > boxes close together seems much less relevant in practice. And I think
> > that's generally likely for operations which can't be well represented
> > as btree opclasses - the substructure that implies inside a Datum will
> > make correlation between consecutive rows less likely.
> 
> Well, I don't know: suppose you're loading geospatial data showing the
> location of every building in some country.  It might easily be the
> case that the data is or can be loaded in an order that provides
> pretty good spatial locality, leading to tight bounding boxes over
> physically consecutive data ranges.

Well, it might be doable to correlate them along one axis, but along
both?  That's more complicated... And even alongside one axis you
already get into problems if your geometries are irregularly sized.
Asingle large polygon will completely destroy indexability for anything
stored physically close by because suddently the minmax range will be
huge... So you'll need to cleverly sort for that as well.

I think hierarchical datastructures are so much better suited for this,
that there's little point trying to fit them into minmax. I can very
well imagine that there's benefit in a gist support for only storing one
pointer per block instead of one pointer per item or such. But it seems
like separate feature.

> But I'm not trying to say that we absolutely have to support that kind
> of thing; what I am trying to say is that there should be a README or
> a mailing list post or some such that says: "We thought about how
> generic to make this.  We considered A, B, and C.  We rejected C as
> too narrow, and A because if we made it that general it would have
> greatly enlarged the disk footprint for the following reasons.
> Therefore we selected B."

Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...

> Basically, I think Heikki asked a good
> question - which was "could we abstract this more?" - and I can't
> recall seeing a clear answer explaining why we could or couldn't and
> what the trade-offs would be.

'could we abstract more' imo is a pretty bad design guideline. It's 'is
there benefit in abstracting more'. Otherwise you end up with way to
complicated systems.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Robert Haas

On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund  wrote:
>> Well, I'm not the guy who does things with geometric data, but I don't
>> want to ignore the significant percentage of our users who are.  As
>> you must surely know, the GIST implementations for geometric data
>> types store bounding boxes on internal pages, and that seems to be
>> useful to people.  What is your reason for thinking that it would be
>> any less useful in this context?
>
> For me minmax indexes are helpful because they allow to generate *small*
> 'coarse' indexes over large volumes of data. From my pov that's possible
> possible because they don't contain item pointers for every contained
> row.
> That'ill imo work well if there are consecutive rows in the table that
> can be summarized into one min/max range. That's quite likely to happen
> for common applications of number of scalar datatypes. But the
> likelihood of placing sufficiently many rows with very similar bounding
> boxes close together seems much less relevant in practice. And I think
> that's generally likely for operations which can't be well represented
> as btree opclasses - the substructure that implies inside a Datum will
> make correlation between consecutive rows less likely.

Well, I don't know: suppose you're loading geospatial data showing the
location of every building in some country.  It might easily be the
case that the data is or can be loaded in an order that provides
pretty good spatial locality, leading to tight bounding boxes over
physically consecutive data ranges.

But I'm not trying to say that we absolutely have to support that kind
of thing; what I am trying to say is that there should be a README or
a mailing list post or some such that says: "We thought about how
generic to make this.   We considered A, B, and C.  We rejected C as
too narrow, and A because if we made it that general it would have
greatly enlarged the disk footprint for the following reasons.
Therefore we selected B."  Basically, I think Heikki asked a good
question - which was "could we abstract this more?" - and I can't
recall seeing a clear answer explaining why we could or couldn't and
what the trade-offs would be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Andres Freund

On 2014-06-17 11:48:10 -0400, Robert Haas wrote:
> On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund  
> wrote:
> > On 2014-06-17 10:26:11 -0400, Robert Haas wrote:
> >> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
> >>  wrote:
> >> > Robert Haas wrote:
> >> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
> >> >>  wrote:
> >> >> > Here's an updated version of this patch, with fixes to all the bugs
> >> >> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers 
> >> >> > and
> >> >> > Amit Kapila for the reports.
> >> >>
> >> >> I'm not very happy with the use of a separate relation fork for
> >> >> storing this data.
> >> >
> >> > Here's a new version of this patch.  Now the revmap is not stored in a
> >> > separate fork, but together with all the regular data, as explained
> >> > elsewhere in the thread.
> >>
> >> Cool.
> >>
> >> Have you thought more about this comment from Heikki?
> >>
> >> http://www.postgresql.org/message-id/52495dd3.9010...@vmware.com
> >
> > Is there actually a significant usecase behind that wish or just a
> > general demand for being generic? To me it seems fairly unlikely you'd
> > end up with something useful by doing a minmax index over bounding
> > boxes.
> 
> Well, I'm not the guy who does things with geometric data, but I don't
> want to ignore the significant percentage of our users who are.  As
> you must surely know, the GIST implementations for geometric data
> types store bounding boxes on internal pages, and that seems to be
> useful to people.  What is your reason for thinking that it would be
> any less useful in this context?

For me minmax indexes are helpful because they allow to generate *small*
'coarse' indexes over large volumes of data. From my pov that's possible
possible because they don't contain item pointers for every contained
row.
That'ill imo work well if there are consecutive rows in the table that
can be summarized into one min/max range. That's quite likely to happen
for common applications of number of scalar datatypes. But the
likelihood of placing sufficiently many rows with very similar bounding
boxes close together seems much less relevant in practice. And I think
that's generally likely for operations which can't be well represented
as btree opclasses - the substructure that implies inside a Datum will
make correlation between consecutive rows less likely.

Maybe I've a major intuition failure here though...

> I do also think that a general demand for being generic ought to carry
> some weight.

Agreed. It's always a balance act. But it's not like this doesn't use a
datatype abstraction concept...

> We have gone to great lengths to make sure that our
> indexing can handle more than just < and >, where a lot of other
> products have not bothered.  I think we have gotten a lot of mileage
> out of that decision and feel that we shouldn't casually back away
> from it.

I don't see this as a case of backing away from that though?

> we shouldn't accept a less-generic
> approach blindly, without questioning whether it's possible to do
> better.

But the aim shouldn't be to add genericity that's not going to be used,
but to add it where it's somewhat likely to help...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Robert Haas

On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund  wrote:
> On 2014-06-17 10:26:11 -0400, Robert Haas wrote:
>> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
>>  wrote:
>> > Robert Haas wrote:
>> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
>> >>  wrote:
>> >> > Here's an updated version of this patch, with fixes to all the bugs
>> >> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
>> >> > Amit Kapila for the reports.
>> >>
>> >> I'm not very happy with the use of a separate relation fork for
>> >> storing this data.
>> >
>> > Here's a new version of this patch.  Now the revmap is not stored in a
>> > separate fork, but together with all the regular data, as explained
>> > elsewhere in the thread.
>>
>> Cool.
>>
>> Have you thought more about this comment from Heikki?
>>
>> http://www.postgresql.org/message-id/52495dd3.9010...@vmware.com
>
> Is there actually a significant usecase behind that wish or just a
> general demand for being generic? To me it seems fairly unlikely you'd
> end up with something useful by doing a minmax index over bounding
> boxes.

Well, I'm not the guy who does things with geometric data, but I don't
want to ignore the significant percentage of our users who are.  As
you must surely know, the GIST implementations for geometric data
types store bounding boxes on internal pages, and that seems to be
useful to people.  What is your reason for thinking that it would be
any less useful in this context?

I do also think that a general demand for being generic ought to carry
some weight.  We have gone to great lengths to make sure that our
indexing can handle more than just < and >, where a lot of other
products have not bothered.  I think we have gotten a lot of mileage
out of that decision and feel that we shouldn't casually back away
from it.  Obviously, we do already have some special-case
optimizations and will likely have more in the future, and there are
can certainly be valid reasons for taking that approach. But it needs
to be justified in some way; we shouldn't accept a less-generic
approach blindly, without questioning whether it's possible to do
better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Greg Stark

On Tue, Jun 17, 2014 at 3:31 PM, Andres Freund  wrote:
> Is there actually a significant usecase behind that wish or just a
> general demand for being generic? To me it seems fairly unlikely you'd
> end up with something useful by doing a minmax index over bounding
> boxes.

Isn't min/max just a 2d bounding box? If you do a bulk data load of
something like the census data then sure, every page will have data
points for some geometrically clustered set of data.

I had in mind to do a small bloom filter per block. In general any
kind of predicate like bounding box should work.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Andres Freund

On 2014-06-17 10:26:11 -0400, Robert Haas wrote:
> On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
>  wrote:
> > Robert Haas wrote:
> >> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
> >>  wrote:
> >> > Here's an updated version of this patch, with fixes to all the bugs
> >> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> >> > Amit Kapila for the reports.
> >>
> >> I'm not very happy with the use of a separate relation fork for
> >> storing this data.
> >
> > Here's a new version of this patch.  Now the revmap is not stored in a
> > separate fork, but together with all the regular data, as explained
> > elsewhere in the thread.
> 
> Cool.
> 
> Have you thought more about this comment from Heikki?
> 
> http://www.postgresql.org/message-id/52495dd3.9010...@vmware.com

Is there actually a significant usecase behind that wish or just a
general demand for being generic? To me it seems fairly unlikely you'd
end up with something useful by doing a minmax index over bounding
boxes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Robert Haas

On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
>>  wrote:
>> > Here's an updated version of this patch, with fixes to all the bugs
>> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
>> > Amit Kapila for the reports.
>>
>> I'm not very happy with the use of a separate relation fork for
>> storing this data.
>
> Here's a new version of this patch.  Now the revmap is not stored in a
> separate fork, but together with all the regular data, as explained
> elsewhere in the thread.

Cool.

Have you thought more about this comment from Heikki?

http://www.postgresql.org/message-id/52495dd3.9010...@vmware.com

I'm concerned that we could end up with one index type of this general
nature for min/max type operations, and then another very similar
index type for geometric operators or text-search operators or what
have you.  Considering the overhead in adding and maintaining an index
AM, I think we should try to be sure that we've done a reasonably
solid job making each one as general as we reasonably can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-01-24 Thread Greg Stark

On Fri, Jan 24, 2014 at 12:58 PM, Claudio Freire  wrote:
>
> What's the status?
>
> I believe I have more than a use for minmax indexes, and wouldn't mind
> lending a hand if it's within my grasp.

I'm also interested in looking at this. Mostly because I have ideas
for other "summary" functions that would be interesting and could use
the same infrastructure otherwise.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-01-24 Thread Claudio Freire

On Fri, Jan 24, 2014 at 2:54 PM, Thom Brown  wrote:
> On 24 January 2014 17:53, Alvaro Herrera  wrote:
>> Thom Brown wrote:
>>> On 8 November 2013 20:11, Alvaro Herrera  wrote:
>>> > Erik Rijkers wrote:
>>> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
>>> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
>>> >> >
>>> >> >> [minmax-5.patch]
>>> >> >
>>> >> > I have the impression it's not quite working correctly.
>>> >
>>> > Here's a version 7 of the patch, which fixes these bugs and adds
>>> > opclasses for a bunch more types (timestamp, timestamptz, date, time,
>>> > timetz), courtesy of Martín Marqués.  It's also been rebased to apply
>>> > cleanly on top of today's master branch.
>>> >
>>> > I have also added a selectivity function, but I'm not positive that it's
>>> > very useful yet.
>>>
>>> This patch doesn't appear to have been submitted to any Commitfest.
>>> Is this still a feature undergoing research then?
>>
>> It's still a planned feature, but I didn't have time to continue work
>> for 2014-01.

What's the status?

I believe I have more than a use for minmax indexes, and wouldn't mind
lending a hand if it's within my grasp.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-01-24 Thread Thom Brown

On 24 January 2014 17:53, Alvaro Herrera  wrote:
> Thom Brown wrote:
>> On 8 November 2013 20:11, Alvaro Herrera  wrote:
>> > Erik Rijkers wrote:
>> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
>> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
>> >> >
>> >> >> [minmax-5.patch]
>> >> >
>> >> > I have the impression it's not quite working correctly.
>> >
>> > Here's a version 7 of the patch, which fixes these bugs and adds
>> > opclasses for a bunch more types (timestamp, timestamptz, date, time,
>> > timetz), courtesy of Martín Marqués.  It's also been rebased to apply
>> > cleanly on top of today's master branch.
>> >
>> > I have also added a selectivity function, but I'm not positive that it's
>> > very useful yet.
>>
>> This patch doesn't appear to have been submitted to any Commitfest.
>> Is this still a feature undergoing research then?
>
> It's still a planned feature, but I didn't have time to continue work
> for 2014-01.

Alles klar.

Thanks

--
Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-01-24 Thread Alvaro Herrera

Thom Brown wrote:
> On 8 November 2013 20:11, Alvaro Herrera  wrote:
> > Erik Rijkers wrote:
> >> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
> >> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
> >> >
> >> >> [minmax-5.patch]
> >> >
> >> > I have the impression it's not quite working correctly.
> >
> > Here's a version 7 of the patch, which fixes these bugs and adds
> > opclasses for a bunch more types (timestamp, timestamptz, date, time,
> > timetz), courtesy of Martín Marqués.  It's also been rebased to apply
> > cleanly on top of today's master branch.
> >
> > I have also added a selectivity function, but I'm not positive that it's
> > very useful yet.
> 
> This patch doesn't appear to have been submitted to any Commitfest.
> Is this still a feature undergoing research then?

It's still a planned feature, but I didn't have time to continue work
for 2014-01.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-01-24 Thread Thom Brown

On 8 November 2013 20:11, Alvaro Herrera  wrote:
> Erik Rijkers wrote:
>> On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
>> > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
>> >
>> >> [minmax-5.patch]
>> >
>> > I have the impression it's not quite working correctly.
>
> Here's a version 7 of the patch, which fixes these bugs and adds
> opclasses for a bunch more types (timestamp, timestamptz, date, time,
> timetz), courtesy of Martín Marqués.  It's also been rebased to apply
> cleanly on top of today's master branch.
>
> I have also added a selectivity function, but I'm not positive that it's
> very useful yet.

This patch doesn't appear to have been submitted to any Commitfest.
Is this still a feature undergoing research then?

-- 
Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-11-15 Thread Jeff Janes

On Fri, Nov 8, 2013 at 12:11 PM, Alvaro Herrera wrote:

> Erik Rijkers wrote:
> > On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
> > > On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
> > >
> > >> [minmax-5.patch]
> > >
> > > I have the impression it's not quite working correctly.
>
> Here's a version 7 of the patch, which fixes these bugs and adds
> opclasses for a bunch more types (timestamp, timestamptz, date, time,
> timetz), courtesy of Martín Marqués.  It's also been rebased to apply
> cleanly on top of today's master branch.
>
> I have also added a selectivity function, but I'm not positive that it's
> very useful yet.
>



I tested it with attached script, but broke out of the "for" loop after 5
iterations (when it had 300,000,005 rows inserted)

Then I did an analyze, and got an error message below:

jjanes=# analyze;
ERROR:  could not truncate file "base/16384/16388_vm" to 488 blocks: it's
only 82 blocks now

16388 is the index's relfilenode.

Here is the backtrace upon entry to the truncate that is going to fail:

#0  mdtruncate (reln=0x23c91b0, forknum=VISIBILITYMAP_FORKNUM, nblocks=488)
at md.c:858
#1  0x0048eb4a in mmRevmapTruncate (rmAccess=0x26ad878,
heapNumBlocks=1327434) at mmrevmap.c:360
#2  0x0048d37a in mmvacuumcleanup (fcinfo=) at
minmax.c:1264
#3  0x0072dcef in FunctionCall2Coll (flinfo=,
collation=, arg1=,
arg2=) at fmgr.c:1323
#4  0x0048c1e5 in index_vacuum_cleanup (info=,
stats=0x0) at indexam.c:715
#5  0x0052a7ce in do_analyze_rel (onerel=0x7f59798589e8,
vacstmt=0x23b0bd8, acquirefunc=0x5298d0 ,
relpages=1327434,
inh=0 '\000', elevel=13) at analyze.c:634
#6  0x0052b320 in analyze_rel (relid=,
vacstmt=0x23b0bd8, bstrategy=) at analyze.c:267
#7  0x0057cba7 in vacuum (vacstmt=0x23b0bd8, relid=, do_toast=1 '\001', bstrategy=,
for_wraparound=0 '\000', isTopLevel=) at
vacuum.c:249
#8  0x00663177 in standard_ProcessUtility (parsetree=0x23b0bd8,
queryString=, context=,
params=0x0,
dest=, completionTag=) at
utility.c:682
#9  0x7f598290b791 in pgss_ProcessUtility (parsetree=0x23b0bd8,
queryString=0x23b0220 "analyze \n;", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0,
dest=0x23b0f18, completionTag=0x7fffd3442f30 "") at
pg_stat_statements.c:825
#10 0x0065fcf7 in PortalRunUtility (portal=0x24195e0,
utilityStmt=0x23b0bd8, isTopLevel=1 '\001', dest=0x23b0f18,
completionTag=0x7fffd3442f30 "")
at pquery.c:1187
#11 0x00660c6d in PortalRunMulti (portal=0x24195e0, isTopLevel=1
'\001', dest=0x23b0f18, altdest=0x23b0f18, completionTag=0x7fffd3442f30 "")
at pquery.c:1318
#12 0x00661323 in PortalRun (portal=0x24195e0,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x23b0f18,
altdest=0x23b0f18,
completionTag=0x7fffd3442f30 "") at pquery.c:816
#13 0x0065dbb4 in exec_simple_query (query_string=0x23b0220
"analyze \n;") at postgres.c:1048
#14 0x0065f259 in PostgresMain (argc=,
argv=, dbname=0x2347be8 "jjanes", username=)
at postgres.c:3992
#15 0x0061b7d0 in BackendRun (argc=,
argv=) at postmaster.c:4085
#16 BackendStartup (argc=, argv=)
at postmaster.c:3774
#17 ServerLoop (argc=, argv=) at
postmaster.c:1585
#18 PostmasterMain (argc=, argv=)
at postmaster.c:1240
#19 0x005b5e90 in main (argc=3, argv=0x2346cd0) at main.c:196


Cheers,

Jeff


minmax_test3.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes (timings)

2013-11-15 Thread Kevin Grittner

Erik Rijkers  wrote:

> Perhaps someone finds these timings useful.

> '--enable-cassert'

Assertions can really distort the timings, and not always equally
for all code paths.  Any chance of re-running those tests without
that?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes (timings)

2013-11-15 Thread Andres Freund

On 2013-11-15 17:11:46 +0100, Erik Rijkers wrote:
> I've been messing with minmax indexes some more so here are some results of 
> that.
> 
> Perhaps someone finds these timings useful.
> 
> 
> Centos 5.7, 32 GB memory, 2 quadcores.
> 
> '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' 
> '--with-pgport=6444' '--enable-depend' '--enable-cassert'
> '--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' 
> '--enable-dtrace'

Just some general advice: doing timings with --enale-cassert isn't that
meaningful - it often can distort results significantly.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-11-11 Thread Jeff Janes

On Mon, Nov 11, 2013 at 12:53 AM, Erik Rijkers  wrote:

> On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
> >
> > Here's a version 7 of the patch, which fixes these bugs and adds
> > opclasses for a bunch more types (timestamp, timestamptz, date, time,
> > timetz), courtesy of Martín Marqués.  It's also been rebased to apply
> > cleanly on top of today's master branch.
> >
> > I have also added a selectivity function, but I'm not positive that it's
> > very useful yet.
> >
> > [minmax-7.patch]
>
> The earlier errors are indeed fixed; now, I've been trying with the
> attached test case but I'm unable to find a query that
> improves with minmax index use.  (it gets used sometimes but speedup is
> negligable).
>


Your data set seems to be completely random.  I believe that minmax indices
would only be expected to be useful when the data is clustered.  Perhaps
you could try it on a table where it is populated something like
 i+random()/10*max_i.

Cheers,

Jeff

Re: [HACKERS] Minmax indexes

2013-11-11 Thread Erik Rijkers

On Mon, November 11, 2013 09:53, Erik Rijkers wrote:
> On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
>>
>> Here's a version 7 of the patch, which fixes these bugs and adds
>> opclasses for a bunch more types (timestamp, timestamptz, date, time,
>> timetz), courtesy of Martín Marqués.  It's also been rebased to apply
>> cleanly on top of today's master branch.
>>
>> I have also added a selectivity function, but I'm not positive that it's
>> very useful yet.
>>
>> [minmax-7.patch]
>
> The earlier errors are indeed fixed; now, I've been trying with the attached 
> test case but I'm unable to find a query that
> improves with minmax index use.  (it gets used sometimes but speedup is 
> negligable).
>

Another issue (I think):

Attached is a program (and output as a .txt file) that gives the following 
(repeatable) error:

$ ./casanova_test.sh
\timing on
drop table if exists t1;
Time: 333.159 ms
create table t1 (i int);
Time: 155.827 ms
create index t1_i_idx on t1 using minmax(i);
Time: 204.031 ms
insert into t1 select generate_series(1, 2500);
Time: 126312.302 ms
analyze t1;
ERROR:  could not truncate file base/21324/26339_vm to 41 blocks: it's only 1 
blocks now
Time: 472.504 ms
[...]


Thanks,

Erik Rijkers


casanova_test.sh
Description: application/shellscript
$ ./casanova_test.sh 
\timing on
drop table if exists t1;
Time: 333.159 ms
create table t1 (i int);
Time: 155.827 ms
create index t1_i_idx on t1 using minmax(i);
Time: 204.031 ms
insert into t1 select generate_series(1, 2500);
Time: 126312.302 ms
analyze t1;
ERROR:  could not truncate file base/21324/26339_vm to 41 blocks: it's only 1 
blocks now
Time: 472.504 ms

\timing on
set enable_bitmapscan=1; explain analyze select i from t1 where i between 
1000 and 10001000;
Time: 0.508 ms
QUERY PLAN  
  
--
 Bitmap Heap Scan on t1  (cost=32.25..117786.77 rows=125001 width=4) (actual 
time=1640.520..4465.768 rows=1001 loops=1)
   Recheck Cond: ((i >= 1000) AND (i <= 10001000))
   Rows Removed by Index Recheck: 24998999
   ->  Bitmap Index Scan on t1_i_idx  (cost=0.00..1.00 rows=125001 width=0) 
(actual time=65.322..65.322 rows=291 loops=1)
 Index Cond: ((i >= 1000) AND (i <= 10001000))
 Total runtime: 4466.003 ms
(6 rows)

Time: 4468.943 ms
set enable_bitmapscan=0; explain analyze select i from t1 where i between 
1000 and 10001000;
Time: 0.146 ms
  QUERY PLAN
   
---
 Seq Scan on t1  (cost=0.00..485621.80 rows=125001 width=4) (actual 
time=1158.065..3332.925 rows=1001 loops=1)
   Filter: ((i >= 1000) AND (i <= 10001000))
   Rows Removed by Filter: 24998999
 Total runtime: .038 ms
(4 rows)
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-11-11 Thread Erik Rijkers

On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
>
> Here's a version 7 of the patch, which fixes these bugs and adds
> opclasses for a bunch more types (timestamp, timestamptz, date, time,
> timetz), courtesy of Martín Marqués.  It's also been rebased to apply
> cleanly on top of today's master branch.
>
> I have also added a selectivity function, but I'm not positive that it's
> very useful yet.
>
> [minmax-7.patch]

The earlier errors are indeed fixed; now, I've been trying with the attached 
test case but I'm unable to find a query that
improves with minmax index use.  (it gets used sometimes but speedup is 
negligable).

That probably means I'm doing something wrong; could you (or anyone) give some 
hints about use-case would be expected?

(Or is it just the unfinished selectivity function?)

Thanks,

Erikjan Rijkers

test.sh
Description: application/shellscript

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-11-08 Thread Alvaro Herrera

Alvaro Herrera escribió:

> I have been playing with having the revmap in the main fork of the index
> rather than a separate one.
...
> This is not complete yet; although I have a proof-of-concept working, I
> still need to write XLog support code and update the pageinspect code to
> match.

Just to be clear: the v7 published elsewhere in this thread does not
contain this revmap-in-main-fork code.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-11-08 Thread Alvaro Herrera

Robert Haas escribió:
> On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
>  wrote:
> > Here's an updated version of this patch, with fixes to all the bugs
> > reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
> > Amit Kapila for the reports.
> 
> I'm not very happy with the use of a separate relation fork for
> storing this data.

I have been playing with having the revmap in the main fork of the index
rather than a separate one.  On the surface many things stay just what
they are; I only had to add a layer beneath the revmap that maps its
logical block numbers to physical block numbers.  The problem with this
is that it needs more disk access, because revmap block numbers cannot be
hardcoded.

After doing some quick math, what I ended up with was to keep an array
of BlockNumbers in the metapage.  Each element in this array points to
array pages; each array page is, in turn, filled with more BlockNumbers,
which this time correspond to the logical revmap pages we used to have
in the revmap fork.  (I initially feared that this design would not
allow me to address enough revmap pages for the largest of tables; but
fortunately this is sufficient unless you configure very small pages,
say BLCKSZ 2kB, use small page ranges, and use small datatypes, say
"char".  I have no problem with saying that that scenario is not
supported if you want to have minmax indexes on 32 TB tables.  I mean,
who uses BLCKSZ smaller than 8kB anyway?).

The advantage of this design is that in order to find any particular
logical revmap page, you always have to do a constant number of page
accesses.  You read the metapage, then read the array page, then read
the revmap page; done.  Another idea I considered was chaining revmap
pages (so each would have a pointer-to-next), or chaining array pages;
but this would have meant that to locate an individual page to the end
of the revmap, you might need to do many accesses.  Not good.

As an optimization for relatively small indexes, we hardcode the page
number for the first revmap page: it's always the page right after the
metapage (so BlockNumber 1).  A revmap page can store, with the default
page size, about 1350 item pointers; so with an index built for page
ranges of 1000 pages per range, you can point to enough index entries
for a ~10 GB table without having the need to examine the first array
page.  This seems pretty acceptable; people with larger tables can
likely spare one extra page accessed every now and then.
(For comparison, each regular minmax page can store about 500 index
tuples, if it's built for a single 4-byte column; this means that the 10
GB table requires a 5-page index.)

This is not complete yet; although I have a proof-of-concept working, I
still need to write XLog support code and update the pageinspect code to
match.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-10-01 Thread Robert Haas

On Mon, Sep 30, 2013 at 1:20 PM, Heikki Linnakangas
 wrote:
> You can almost create a bounding box opclass in the current implementation,
> by mapping < operator to "contains" and > to "not contains". But there's no
> support for creating a new, larger, bounding box on insert. It will just
> replace the max with the new value if it's "greater than", when it should
> create a whole new value to store in the index that covers both the old and
> the new values. (or less than? I'm not sure which way those operators would
> work..)

This sounds an awful lot like GiST's "union" operation.  Actually,
following the GiST model of having "union" and "consistent" operations
might be a smart way to go.  Then the exact index semantics could be
decided by the opclass.  This might not even be that much extra code;
the existing consistent and union functions for GiST are pretty short.
 That way, it'd be easy to add new opclasses with somewhat different
behavior; the common thread would be that every opclass of this new AM
works by summarizing a physical page range into a single indexed
value.  You might call the AM something like "summary" or "sparse" and
then have "minmax_ops" for your first opclass.

> In fact, even with regular b-tree operators, over integers for example, you
> don't necessarily want to store both min and max. If you only ever perform
> queries like "WHERE col > ?", there's no need to track the min value. So to
> make this really general, you should be able to create an index on only the
> minimum or maximum. Or if you want both, you can store them as separate
> index columns. Something like:
>
> CREATE INDEX minindex ON table (col ASC); -- For min
> CREATE INDEX minindex ON table (col DESC);  -- For max
> CREATE INDEX minindex ON table (col ASC, col DESC); -- For both

This doesn't seem very general, since you're relying on the fact that
ASC and DESC map to < and >.  It's not clear what you'd write here if
you wanted to optimize #$ and @!.  But something based on opclasses
will work, since each opclass can support an arbitrary set of
operators.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 138 matches

Mail list logo