subject:"Re\: \[HACKERS\] A design for amcheck heapam verification"

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Peter Geoghegan

On Sat, Mar 31, 2018 at 8:32 PM, Andres Freund  wrote:
> LGTM, pushed.  Closing CF entry. Yay!   Only 110 to go.

Thanks Andres!

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Andres Freund

On 2018-03-31 20:25:24 -0700, Peter Geoghegan wrote:
> On Sat, Mar 31, 2018 at 8:09 PM, Peter Geoghegan  wrote:
> > I was thinking of using rint(), which is what you get if you call
> > round(float8) from SQL.
> 
> Attached patch does it that way. Note that there are float/int cast
> regression tests that ensure that rint() behaves consistently on
> supported platforms -- see commit 06bf0dd6.

LGTM, pushed.  Closing CF entry. Yay!   Only 110 to go.

- Andres

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Peter Geoghegan

On Sat, Mar 31, 2018 at 8:09 PM, Peter Geoghegan  wrote:
> I was thinking of using rint(), which is what you get if you call
> round(float8) from SQL.

Attached patch does it that way. Note that there are float/int cast
regression tests that ensure that rint() behaves consistently on
supported platforms -- see commit 06bf0dd6.

-- 
Peter Geoghegan
From ff0bd32d33ceb6b5650e28d76ee794961862a4fc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sat, 31 Mar 2018 19:54:48 -0700
Subject: [PATCH] Fix non-portable call to round().

round() is from C99.  Apparently, not all supported versions of Visual
Studio make it available.  Use rint() instead.  There are behavioral
differences between round() and rint(), but they should not matter to
the Bloom filter optimal_k() function.  We already assume POSIX behavior
for rint(), so there is no question of rint() not using "rounds towards
nearest" as its rounding mode.

Cleanup from commit 51bc271790eb234a1ba4d14d3e6530f70de92ab5.

Per buildfarm member thrips.

Author: Peter Geoghegan
---
 src/backend/lib/bloomfilter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
index eb08f4a..3565480 100644
--- a/src/backend/lib/bloomfilter.c
+++ b/src/backend/lib/bloomfilter.c
@@ -240,7 +240,7 @@ my_bloom_power(uint64 target_bitset_bits)
 static int
 optimal_k(uint64 bitset_bits, int64 total_elems)
 {
-	int			k = round(log(2.0) * bitset_bits / total_elems);
+	int			k = rint(log(2.0) * bitset_bits / total_elems);

 	return Max(1, Min(k, MAX_HASH_FUNCS));
 }
-- 
2.7.4

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Peter Geoghegan

On Sat, Mar 31, 2018 at 8:08 PM, Andres Freund  wrote:
>> round() is from C99, apparently. I'll investigate a fix.
>
> Just replacing it with a floor(val + 0.5) ought to do the trick, right?

I was thinking of using rint(), which is what you get if you call
round(float8) from SQL.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Andres Freund

On 2018-03-31 19:43:45 -0700, Peter Geoghegan wrote:
> On Sat, Mar 31, 2018 at 3:15 PM, Peter Geoghegan  wrote:
> > On Sat, Mar 31, 2018 at 2:59 PM, Peter Geoghegan  wrote:
> >> WFM. I have all the information I need to produce the next revision now.
> >
> > I might as well post this one first. I'll have 0002 for you in a short 
> > while.
> 
> Looks like thrips doesn't like this, though other Windows buildfarm
> animals are okay with it.
> 
> round() is from C99, apparently. I'll investigate a fix.

Just replacing it with a floor(val + 0.5) ought to do the trick, right?

Greetings,

Andres Freund

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Peter Geoghegan

On Sat, Mar 31, 2018 at 3:15 PM, Peter Geoghegan  wrote:
> On Sat, Mar 31, 2018 at 2:59 PM, Peter Geoghegan  wrote:
>> WFM. I have all the information I need to produce the next revision now.
>
> I might as well post this one first. I'll have 0002 for you in a short while.

Looks like thrips doesn't like this, though other Windows buildfarm
animals are okay with it.

round() is from C99, apparently. I'll investigate a fix.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Peter Geoghegan

On Sat, Mar 31, 2018 at 3:15 PM, Peter Geoghegan  wrote:
>> WFM. I have all the information I need to produce the next revision now.
>
> I might as well post this one first. I'll have 0002 for you in a short while.

Attached is 0002 -- the amcheck enhancement itself. As requested by
Andres, this adds a new overloaded set of functions, rather than
dropping and recreating functions to change their signature.

I'm pretty sure that avoiding issues with dependencies by using this
approach is unprecedented, so I had to use my own judgement on how to
deal with a couple of things. I decided not to create a new C symbol
for the new function versions, preferring to leave it to the existing
PG_NARGS() tests. I guess this was probably what you intended I should
do, based on your "Given the PG_NARGS() checks..." remark. I also
chose to not document the single argument functions in the docs. I
suppose that we should consider these to be implementation details of
a work-around for dependency breakage, something that doesn't need to
be documented. That's a bit like how we don't document functions
within certain extensions that are designed just to get called within
a view definition. I don't feel strongly about it, though.

No other changes to report. I did mention that this would have a few
small changes yesterday; no need to repeat the details now.

Thanks
-- 
Peter Geoghegan
From 9e2c1469013374117209c9c0c00e12ecf10ab7b9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH v8 2/2] Add amcheck verification of heap relations.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  check that each heap tuple that should have an
index entry does in fact have one.  The extra checking is performed at
the end of the existing nbtree checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.

Reusing the CREATE INDEX infrastructure allows the new verification
option to automatically benefit from the heap consistency checks that
CREATE INDEX already performs.  CREATE INDEX does thorough sanity
checking of HOT chains, so the new check actually manages to detect
problems in heap-only tuples.

Author: Peter Geoghegan
Reviewed-By: Pavan Deolasee, Andres Freund
Discussion: https://postgr.es/m/CAH2-Wzm5VmG7cu1N-H=nnS57wZThoSDQU+F5dewx3o84M+jY=g...@mail.gmail.com
---
 contrib/amcheck/Makefile |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql|  29 +++
 contrib/amcheck/amcheck.control  |   2 +-
 contrib/amcheck/expected/check_btree.out |  12 +-
 contrib/amcheck/sql/check_btree.sql  |   7 +-
 contrib/amcheck/verify_nbtree.c  | 343 ---
 doc/src/sgml/amcheck.sgml| 126 +---
 7 files changed, 458 insertions(+), 63 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)

 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"

 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 000..088416e
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,29 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+-- In order to avoid issues with dependencies when updating amcheck to 1.1,
+-- create new, overloaded versions of the 1.0 functions
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass,
+heapallindexed boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass,
+heapallindexed boolean)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying relation integrity'

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Peter Geoghegan

On Sat, Mar 31, 2018 at 2:59 PM, Peter Geoghegan  wrote:
> WFM. I have all the information I need to produce the next revision now.

I might as well post this one first. I'll have 0002 for you in a short while.

-- 
Peter Geoghegan


v8-0001-Add-Bloom-filter-data-structure-implementation.patch
Description: Binary data

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Peter Geoghegan

On Sat, Mar 31, 2018 at 2:56 PM, Andres Freund  wrote:
>> So you're asking for something like bt_index_check_heap() +
>> bt_index_parent_check_heap()? Or, are you talking about function
>> overloading?
>
> The latter. That addresses my concerns about dropping the function and
> causing issues due to dependencies.

WFM. I have all the information I need to produce the next revision now.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Andres Freund

On 2018-03-31 11:27:14 -0700, Peter Geoghegan wrote:
> On Fri, Mar 30, 2018 at 7:04 PM, Andres Freund  wrote:
> > I'm just saying that there should be two functions here, rather than 
> > dropping the old definition, and creating s new one with a default argument.
> 
> So you're asking for something like bt_index_check_heap() +
> bt_index_parent_check_heap()? Or, are you talking about function
> overloading?

The latter. That addresses my concerns about dropping the function and
causing issues due to dependencies.

Greetings,

Andres Freund

Re: [HACKERS] A design for amcheck heapam verification

2018-03-31 Thread Peter Geoghegan

On Fri, Mar 30, 2018 at 7:04 PM, Andres Freund  wrote:
> I'm just saying that there should be two functions here, rather than dropping 
> the old definition, and creating s new one with a default argument.

So you're asking for something like bt_index_check_heap() +
bt_index_parent_check_heap()? Or, are you talking about function
overloading?

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-30 Thread Peter Geoghegan

On Fri, Mar 30, 2018 at 6:55 PM, Peter Geoghegan  wrote:
> On Fri, Mar 30, 2018 at 6:20 PM, Andres Freund  wrote:
>>> What would the upcasting you have in mind look like?
>>
>> Just use UINT64CONST()?  Let's try not to introduce further code that'll
>> need to get painfully fixed.
>
> What I have right now is based on imitating the style that Tom uses.
> I'm pretty sure that I did something like that in the patch I posted
> that became 9563d5b5, which Tom editorialized to be in
> "maintenance_work_mem * 1024L" style. That was only about 2 years ago.
>
> I'll go ahead and use UINT64CONST(), as requested, but I wish that the
> messaging on the right approach to such a standard question of style
> was not contradictory.

BTW, it's not obvious that using UINT64CONST() is going to become the
standard in the future. You may recall that commit 79e0f87a156 fixed a
bug that came from using Size instead of long in tuplesort.c;
tuplesort expects a signed type, since availMem must occasionally go
negative. Noah was not aware of using long for work_mem calculations,
imagining quite reasonable (but incorrectly) that that would break on
Windows, in the process missing this subtle negative availMem
requirement.

The 79e0f87a156 fix changed tuplesort to use int64 (even though it
could have been changed tuplesort back to using long without
consequence instead), which I thought might spread further and
eventually become a coding standard to follow. The point of things
like coding standards around expanding work_mem KB arguments to bytes,
or the MaxAllocSize thing, is that they cover a wide variety of cases
quite well, without much danger. Now, as it happens the Bloom filter
doesn't need to think about something like a negative availMem, so I
could use uint64 (or UINT64CONST()) for the size of the allocation.
But let's not pretend that that doesn't have its own problems. Am I
expected to learn everyone's individual preferences and prejudices
here?

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-30 Thread Andres Freund



On March 30, 2018 6:55:50 PM PDT, Peter Geoghegan  wrote:
>On Fri, Mar 30, 2018 at 6:20 PM, Andres Freund 
>wrote:
>>> What would the upcasting you have in mind look like?
>>
>> Just use UINT64CONST()?  Let's try not to introduce further code
>that'll
>> need to get painfully fixed.
>
>What I have right now is based on imitating the style that Tom uses.
>I'm pretty sure that I did something like that in the patch I posted
>that became 9563d5b5, which Tom editorialized to be in
>"maintenance_work_mem * 1024L" style. That was only about 2 years ago.
>
>I'll go ahead and use UINT64CONST(), as requested, but I wish that the
>messaging on the right approach to such a standard question of style
>was not contradictory.
>
>> Well, they're not supposed to be, but they are. Practical realities
>> matter...  But I think it's moot because we don't use any of the bad
>> ones, I only got to that later...
>
>Okay. My point was that that's not really the right level to solve the
>problem (if that was a problem we had, which it turns out it isn't).
>Not that practical realities don't matter.
>
>>> This sounds like a general argument against ever changing a
>function's
>>> signature. It's not like we haven't done that a number of times in
>>> extensions like pageinspect. Does it really matter?
>>
>> Yes, it does. And imo we shouldn't.
>
>My recollection is that you didn't like the original bt_index_check()
>function signature in the final v10 CF because it didn't allow you to
>add arbitrary new arguments in the future; the name was too
>prescriptive to support that. You wanted to only have one function
>signature, envisioning a time when there'd be many arguments, that
>you'd typically invoke using the named function argument (=>) syntax,
>since many arguments may not actually be interesting, depending on the
>exact details. I pushed back specifically because I thought there
>should be simple rules for the heavyweight lock strength --
>bt_index_check() should always acquire an ASL and only an ASL, and so
>on. I also thought that we were unlikely to need many more options
>that specifically deal with B-Tree indexes.
>
>You brought this up again recently, recalling that my original
>preferred signature style (the one that we went with) was bad because
>it now necessitates altering a function signature to add a new
>argument. I must admit that I am rather confused. Weren't *you* the
>one that wanted to add lots of new arguments in the future? As I said,
>I'm sure that I'm done adding new arguments to bt_index_check() +
>bt_index_parent_check(). It's possible that there'll be another way to
>get essentially the same functionality at a coarser granularity (e.g.
>database, table), certainly, but I don't see that there is much more
>that we can do while starting with a B-Tree index as our target.
>
>Please propose an alternative user interface for the new check. If you
>prefer, I can invent new bt_index_check_heap() +
>bt_index_parent_check_heap() variants. That would be okay with me.

I'm just saying that there should be two functions here, rather than dropping 
the old definition, and creating s new one with a default argument.

(Phone, more another time)

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] A design for amcheck heapam verification

2018-03-30 Thread Peter Geoghegan

On Fri, Mar 30, 2018 at 6:20 PM, Andres Freund  wrote:
>> What would the upcasting you have in mind look like?
>
> Just use UINT64CONST()?  Let's try not to introduce further code that'll
> need to get painfully fixed.

What I have right now is based on imitating the style that Tom uses.
I'm pretty sure that I did something like that in the patch I posted
that became 9563d5b5, which Tom editorialized to be in
"maintenance_work_mem * 1024L" style. That was only about 2 years ago.

I'll go ahead and use UINT64CONST(), as requested, but I wish that the
messaging on the right approach to such a standard question of style
was not contradictory.

> Well, they're not supposed to be, but they are. Practical realities
> matter...  But I think it's moot because we don't use any of the bad
> ones, I only got to that later...

Okay. My point was that that's not really the right level to solve the
problem (if that was a problem we had, which it turns out it isn't).
Not that practical realities don't matter.

>> This sounds like a general argument against ever changing a function's
>> signature. It's not like we haven't done that a number of times in
>> extensions like pageinspect. Does it really matter?
>
> Yes, it does. And imo we shouldn't.

My recollection is that you didn't like the original bt_index_check()
function signature in the final v10 CF because it didn't allow you to
add arbitrary new arguments in the future; the name was too
prescriptive to support that. You wanted to only have one function
signature, envisioning a time when there'd be many arguments, that
you'd typically invoke using the named function argument (=>) syntax,
since many arguments may not actually be interesting, depending on the
exact details. I pushed back specifically because I thought there
should be simple rules for the heavyweight lock strength --
bt_index_check() should always acquire an ASL and only an ASL, and so
on. I also thought that we were unlikely to need many more options
that specifically deal with B-Tree indexes.

You brought this up again recently, recalling that my original
preferred signature style (the one that we went with) was bad because
it now necessitates altering a function signature to add a new
argument. I must admit that I am rather confused. Weren't *you* the
one that wanted to add lots of new arguments in the future? As I said,
I'm sure that I'm done adding new arguments to bt_index_check() +
bt_index_parent_check(). It's possible that there'll be another way to
get essentially the same functionality at a coarser granularity (e.g.
database, table), certainly, but I don't see that there is much more
that we can do while starting with a B-Tree index as our target.

Please propose an alternative user interface for the new check. If you
prefer, I can invent new bt_index_check_heap() +
bt_index_parent_check_heap() variants. That would be okay with me.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-30 Thread Andres Freund

On 2018-03-29 19:42:42 -0700, Peter Geoghegan wrote:
> >> + /*
> >> +  * Aim for two bytes per element; this is sufficient to get a false
> >> +  * positive rate below 1%, independent of the size of the bitset or 
> >> total
> >> +  * number of elements.  Also, if rounding down the size of the 
> >> bitset to
> >> +  * the next lowest power of two turns out to be a significant drop, 
> >> the
> >> +  * false positive rate still won't exceed 2% in almost all cases.
> >> +  */
> >> + bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
> >> + /* Minimum allowable size is 1MB */
> >> + bitset_bytes = Max(1024L * 1024L, bitset_bytes);
> >
> > Some upcasting might be advisable, to avoid dangers of overflows?
> 
> When it comes to sizing work_mem, using long literals to go from KB to
> bytes is How It's Done™. I actually think that's silly myself, because
> it's based on the assumption that long is wider than int, even though
> it isn't on Windows. But that's okay because we have the old work_mem
> size limits on Windows.

> What would the upcasting you have in mind look like?

Just use UINT64CONST()?  Let's try not to introduce further code that'll
need to get painfully fixed.


> 
> >> + /* Use 64-bit hashing to get two independent 32-bit hashes */
> >> + hash = DatumGetUInt64(hash_any_extended(elem, len, filter->seed));
> >
> > Hm. Is that smart given how some hash functions are defined? E.g. for
> > int8 the higher bits aren't really that independent for small values:
> 
> Robert suggested that I do this. I don't think that we need to make it
> about the quality of the hash function that we have available. That
> really seems like a distinct question to me. It seems clear that this
> ought to be fine (or should be fine, if you prefer). I understand why
> you're asking about this, but it's not scalable to ask every user of a
> hash function to care that it might be a bit crap. Hash functions
> aren't supposed to be a bit crap.

Well, they're not supposed to be, but they are. Practical realities
matter...  But I think it's moot because we don't use any of the bad
ones, I only got to that later...


> >> +DROP FUNCTION bt_index_check(regclass);
> >> +CREATE FUNCTION bt_index_check(index regclass,
> >> +heapallindexed boolean DEFAULT false)
> >> +RETURNS VOID
> >> +AS 'MODULE_PATHNAME', 'bt_index_check'
> >> +LANGUAGE C STRICT PARALLEL RESTRICTED;
> >
> > This breaks functions et al referencing the existing function.
> 
> This sounds like a general argument against ever changing a function's
> signature. It's not like we haven't done that a number of times in
> extensions like pageinspect. Does it really matter?

Yes, it does. And imo we shouldn't.


> > Also, I
> > don't quite recall the rules, don't we have to drop the function from
> > the extension first?
> 
> But...I did drop the function?

I mean in the ALTER EXTENSION ... DROP FUNCTION sense.



Greetings,

Andres Freund

Re: [HACKERS] A design for amcheck heapam verification

2018-03-30 Thread Peter Geoghegan

On Thu, Mar 29, 2018 at 7:42 PM, Peter Geoghegan  wrote:
>> We should be able to get this into v11...
>
> That's a relief. Thanks.

I have a new revision lined up. I won't send anything to the list
until you clear up what you meant in those few cases where it seemed
unclear.

I also acted on some of the feedback from Pavan, which I'd previously
put off/deferred. It seemed like there was no reason not to do it his
way when it came to minor stylistic points that were non-issues to me.

Finally, I added something myself:

 745 /*
 746  * lp_len should match the IndexTuple reported length
exactly, since
 747  * lp_len is completely redundant in indexes, and both
sources of tuple
 748  * length are MAXALIGN()'d.  nbtree does not use lp_len all that
 749  * frequently, and is surprisingly tolerant of corrupt
lp_len fields.
 750  */
 751 if (tupsize != ItemIdGetLength(itemid))
 752 ereport(ERROR,
 753 (errcode(ERRCODE_INDEX_CORRUPTED),
 754  errmsg("index tuple size does not equal
lp_len in index \"%s\"",
 755 RelationGetRelationName(state->rel)),
 756  errdetail_internal("Index tid=(%u,%u) tuple
size=%zu lp_len=%u page lsn=%X/%X.",
 757 state->targetblock, offset,
 758 tupsize, ItemIdGetLength(itemid),
 759 (uint32) (state->targetlsn >> 32),
 760 (uint32) state->targetlsn),
 761  errhint("This could be a torn page problem")));

It seems to me that we should take the opportunity to verify each
tuple's IndexTupleSize() value, now that we'll be using it directly.
There happens to be an easy way to do that, so why not just do it?

This is unlikely to find an error that we wouldn't have detected
anyway, even without using the new heapallindexed option. However, it
seems likely that this error message is more accurate in the real
world cases where it will be seen. A torn page can leave us with a
page image that looks surprisingly not-so-corrupt.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-29 Thread Peter Geoghegan

On Thu, Mar 29, 2018 at 6:18 PM, Andres Freund  wrote:
>> This commit adds a test harness extension module, test_bloomfilter.  It
>> can be used to get a sense of how the Bloom filter implementation
>> performs under varying conditions.
>
> Maybe add a short paragraph explaining what this'll be used for soon.

Sure.

>> @@ -12,7 +12,7 @@ subdir = src/backend/lib
>>  top_builddir = ../../..
>>  include $(top_builddir)/src/Makefile.global
>>
>> -OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
>> -knapsack.o pairingheap.o rbtree.o stringinfo.o
>> +OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
>> +   ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
>
> *NOT* for this patch: I really wonder whether we should move towards a
> style where there's only ever a single object per-line. Would make
> things like this easier to view and conflicts easier to resolve.

That does seem like it would be a minor improvement.

>> --- /dev/null
>> +++ b/src/backend/lib/bloomfilter.c
>> @@ -0,0 +1,304 @@
>> +/*-
>> + *
>> + * bloomfilter.c
>> + *   Space-efficient set membership testing
>> + *
>> + * A Bloom filter is a probabilistic data structure that is used to test an
>> + * element's membership of a set.
>
> s/of/in/?

Wikipedia says "A Bloom filter is a space-efficient probabilistic data
structure, conceived by Burton Howard Bloom in 1970, that is used to
test whether an element is a member of a set". I think that either
works just as well.

>> False positives are possible, but false
>> + * negatives are not; a test of membership of the set returns either 
>> "possibly
>> + * in set" or "definitely not in set".  This can be very space efficient 
>> when
>> + * individual elements are larger than a few bytes, because elements are 
>> hashed
>> + * in order to set bits in the Bloom filter bitset.
>
> The second half of this paragraph isn't easy to understand.

I'll tweak it.

>> + * Elements can be added to the set, but not removed.  The more elements 
>> that
>> + * are added, the larger the probability of false positives.  Caller must 
>> hint
>> + * an estimated total size of the set when its Bloom filter is initialized.
>> + * This is used to balance the use of memory against the final false 
>> positive
>> + * rate.
>
> s/its Bloom/the Bloom/?

Okay. I'll give you that one.

>> + * The implementation is well suited to data synchronization problems 
>> between
>> + * unordered sets, especially where predictable performance is important and
>> + * some false positives are acceptable.
>
> I'm not finding "data synchronization" very descriptive. Makes me think
> of multi-threaded races and such.

Again, this is from Wikipedia:

https://en.wikipedia.org/wiki/Bloom_filter#Data_synchronization

https://en.wikipedia.org/wiki/Data_synchronization

>> +/*
>> + * Create Bloom filter in caller's memory context.  This should get a false
>> + * positive rate of between 1% and 2% when bitset is not constrained by 
>> memory.
>
> s/should/aims at/?

I think that "should" is accurate, and no less informative. I'm not
going to argue with you, though -- I'll change it.

>> + * total_elems is an estimate of the final size of the set.  It ought to be
>> + * approximately correct, but we can cope well with it being off by perhaps 
>> a
>> + * factor of five or more.  See "Bloom Filters in Probabilistic 
>> Verification"
>> + * (Dillinger & Manolios, 2004) for details of why this is the case.
>
> I'd simplify the language here. I'd replace ought with should at the
> very least.  Replace we with "the bloom filter" or similar?

I don't see what's wrong with "ought", but okay. I don't see what's
wrong with "we", but okay.

>> + * bloom_work_mem is sized in KB, in line with the general work_mem 
>> convention.
>> + * This determines the size of the underlying bitset (trivial bookkeeping 
>> space
>> + * isn't counted).  The bitset is always sized as a power-of-two number of
>> + * bits, and the largest possible bitset is 512MB.  The implementation 
>> rounds
>> + * down as needed.
>
> "as needed" should be expanded. Just say ~"Only the required amount of
> memory is allocated"?

Okay.

>> +bloom_filter *
>> +bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
>> +{
>> + bloom_filter *filter;
>> + int bloom_power;
>> + uint64  bitset_bytes;
>> + uint64  bitset_bits;
>> +
>> + /*
>> +  * Aim for two bytes per element; this is sufficient to get a false
>> +  * positive rate below 1%, independent of the size of the bitset or 
>> total
>> +  * number of elements.  Also, if rounding down the size of the bitset 
>> to
>> +  * the next lowest power of two turns out to be a significant drop, the
>> +  * false positive rate still won't exceed 2% in almost all cases.
>> +  */
>> + bitset_bytes

Re: [HACKERS] A design for amcheck heapam verification

2018-03-29 Thread Andres Freund

Hi,

On 2018-03-26 20:20:57 -0700, Peter Geoghegan wrote:
> From ede1ba731dc818172a94adbb6331323c1f2b1170 Mon Sep 17 00:00:00 2001
> From: Peter Geoghegan 
> Date: Thu, 24 Aug 2017 20:58:21 -0700
> Subject: [PATCH v7 1/2] Add Bloom filter data structure implementation.
> 
> A Bloom filter is a space-efficient, probabilistic data structure that
> can be used to test set membership.  Callers will sometimes incur false
> positives, but never false negatives.  The rate of false positives is a
> function of the total number of elements and the amount of memory
> available for the Bloom filter.
> 
> Two classic applications of Bloom filters are cache filtering, and data
> synchronization testing.  Any user of Bloom filters must accept the
> possibility of false positives as a cost worth paying for the benefit in
> space efficiency.
> 
> This commit adds a test harness extension module, test_bloomfilter.  It
> can be used to get a sense of how the Bloom filter implementation
> performs under varying conditions.

Maybe add a short paragraph explaining what this'll be used for soon.

> @@ -12,7 +12,7 @@ subdir = src/backend/lib
>  top_builddir = ../../..
>  include $(top_builddir)/src/Makefile.global
>  
> -OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
> -knapsack.o pairingheap.o rbtree.o stringinfo.o
> +OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
> +   ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o

*NOT* for this patch: I really wonder whether we should move towards a
style where there's only ever a single object per-line. Would make
things like this easier to view and conflicts easier to resolve.


> --- /dev/null
> +++ b/src/backend/lib/bloomfilter.c
> @@ -0,0 +1,304 @@
> +/*-
> + *
> + * bloomfilter.c
> + *   Space-efficient set membership testing
> + *
> + * A Bloom filter is a probabilistic data structure that is used to test an
> + * element's membership of a set.

s/of/in/?


> False positives are possible, but false
> + * negatives are not; a test of membership of the set returns either 
> "possibly
> + * in set" or "definitely not in set".  This can be very space efficient when
> + * individual elements are larger than a few bytes, because elements are 
> hashed
> + * in order to set bits in the Bloom filter bitset.

The second half of this paragraph isn't easy to understand.


> + * Elements can be added to the set, but not removed.  The more elements that
> + * are added, the larger the probability of false positives.  Caller must 
> hint
> + * an estimated total size of the set when its Bloom filter is initialized.
> + * This is used to balance the use of memory against the final false positive
> + * rate.

s/its Bloom/the Bloom/?


> + * The implementation is well suited to data synchronization problems between
> + * unordered sets, especially where predictable performance is important and
> + * some false positives are acceptable.

I'm not finding "data synchronization" very descriptive. Makes me think
of multi-threaded races and such.



> +/*
> + * Create Bloom filter in caller's memory context.  This should get a false
> + * positive rate of between 1% and 2% when bitset is not constrained by 
> memory.

s/should/aims at/?


> + * total_elems is an estimate of the final size of the set.  It ought to be
> + * approximately correct, but we can cope well with it being off by perhaps a
> + * factor of five or more.  See "Bloom Filters in Probabilistic Verification"
> + * (Dillinger & Manolios, 2004) for details of why this is the case.

I'd simplify the language here. I'd replace ought with should at the
very least.  Replace we with "the bloom filter" or similar?


> + * bloom_work_mem is sized in KB, in line with the general work_mem 
> convention.
> + * This determines the size of the underlying bitset (trivial bookkeeping 
> space
> + * isn't counted).  The bitset is always sized as a power-of-two number of
> + * bits, and the largest possible bitset is 512MB.  The implementation rounds
> + * down as needed.

"as needed" should be expanded. Just say ~"Only the required amount of
memory is allocated"?


> +bloom_filter *
> +bloom_create(int64 total_elems, int bloom_work_mem, uint32 seed)
> +{
> + bloom_filter *filter;
> + int bloom_power;
> + uint64  bitset_bytes;
> + uint64  bitset_bits;
> +
> + /*
> +  * Aim for two bytes per element; this is sufficient to get a false
> +  * positive rate below 1%, independent of the size of the bitset or 
> total
> +  * number of elements.  Also, if rounding down the size of the bitset to
> +  * the next lowest power of two turns out to be a significant drop, the
> +  * false positive rate still won't exceed 2% in almost all cases.
> +  */
> + bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
> +

Re: [HACKERS] A design for amcheck heapam verification

2018-03-29 Thread Peter Geoghegan

On Thu, Mar 29, 2018 at 2:28 AM, Simon Riggs  wrote:
> I understand we are adding a check to verify heap against index and
> this will take longer than before. When it runs does it report
> progress of the run via pg_stat_activity, so we can monitor how long
> it will take?

My point to Pavan was that the checker function that uses a weaker
lock, bt_index_check(), imitates a CREATE INDEX CONCURRENTLY. However,
unlike CIC, it acquires an ASL, not a ShareUpdateExclusiveLock. I
believe that this is safe, because we only actually perform a process
that's similar to the first heap scan of a CIC, without the other CIC
steps. (The variant that uses a ShareLock, bt_index_parent_check(),
imitates a CREATE INDEX, so no difference in lock strength there.)

amcheck is still just a couple of SQL-callable functions, so the
closest thing that there is to progress monitoring is DEBUG traces,
which I've found to be informative in real-world situations. Of
course, only a well informed user can interpret that. No change there.

> Locking is also an important concern.
>
> If we need a ShareLock to run the extended check and the check runs
> for a long time, when would we decide to run that? This sounds like it
> will cause a production outage, so what are the pre-conditions that
> would lead us to say "we'd better run this". For example, if running
> this is known to be signficantly faster than running CREATE INDEX,
> that might be an argument for someone to run this first if index
> corruption is suspected.

I think that most people will choose to use bt_index_check(), since,
as I said, it only requires an ASL, as opposed to
bt_index_parent_check(), which requires a ShareLock. This is
especially likely because there isn't much difference between how
thorough verification is. The fact that bt_index_check() is likely the
best thing to use for routine verification is documented in the
v10/master version [1], which hasn't changed. That said, there are
definitely environments where nobody cares a jot that a ShareLock is
needed, which is why we have bt_index_parent_check(). Often, amcheck
is used when the damage is already done, so that it can be fully
assessed.

This patch does not really change the interface; it just adds a new
heapallindexed argument, which has a default of 'false'.
bt_index_check() is unlikely to cause too many problems in production.
Heroku ran it on all databases at one point, and it wasn't too much
trouble. Of course, that was a version that lacked this heapallindexed
enhancement, which slows down verification by rather a lot when
actually used. My rough estimate is that heapallindexed verification
makes the process take 5x longer.

> If it detects an issue, can it fix the issue for the index by
> injecting correct entries? If not then we will have to run CREATE
> INDEX afterwards anyway, which makes it more likely that people would
> just run CREATE INDEX and not bother with the check.

It does not fix anything. I think that the new check is far more
likely to find problems in the heap than in the index, which is the
main reason for this.

The new check can only begin to run at the point where the index
structure has been found to be self-consistent, which is the main
reason why it seems like more of a heap checker than an index checker.
Also, that's what it actually found in the couple of interesting cases
that we've seen. It detects "freeze the dead" corruption, at least
with the test cases we have available. It also detects corruption
caused by failing to detect broken HOT chains during an initial CREATE
INDEX; there were two such bugs in CREATE INDEX CONCURRENTLY, one in
2012, and another in 2017, as you'll recall. The 2017 one (the CIC bug
that Pavan found through mechanical testing) went undetected for a
very long time; I think that a tool like this greatly increases our
odds of early detection of that kind of thing.

These are both issues that kind of seem like index corruption, that
are actually better understood as heap corruption. It's subtle.

> So my initial questions are about when we would run this and making
> sure that is documented.

Good question. I find it hard to offer general advice about this, to
be honest. In theory, you should never need to run it, because
everything should work, and in practice that's generally almost true.
I've certainly used it when investigating problems after the fact, and
as a general smoke-test, where it works well. I would perhaps
recommend running it once a week, although that's a fairly arbitrary
choice. The docs in v10 don't take a position on this, so while I tend
to agree we could do better, it is a preexisting issue.

[1] https://www.postgresql.org/docs/10/static/amcheck.html#id-1.11.7.11.7
-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-29 Thread Simon Riggs

On 29 March 2018 at 01:49, Peter Geoghegan  wrote:

>>> IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
>>> ShareUpdateExclusiveLock (it just says SharedLock), because that lock
>>> strength doesn't have anything to do with IndexBuildHeapRangeScan()
>>> when it operates with an MVCC snapshot. I think that this means that
>>> this patch doesn't need to update comments within
>>> IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
>>> independent.
>>
>>
>> Ok, I agree. But note that we are now invoking that code with
>> AccessShareLock() whereas the existing callers either use ShareLock or
>> ShareUpdateExclusiveLock. That's probably does not matter, but it's a change
>> worth noting.
>
> Fair point, even though the ShareUpdateExclusiveLock case isn't
> actually acknowledged by IndexBuildHeapRangeScan(). Can we leave this
> one up to the committer, too? I find it very hard to argue either for
> or against this, and I want to avoid "analysis paralysis" at this
> important time.

The above discussion doesn't make sense to me, hope someone will explain.

I understand we are adding a check to verify heap against index and
this will take longer than before. When it runs does it report
progress of the run via pg_stat_activity, so we can monitor how long
it will take?

Locking is also an important concern.

If we need a ShareLock to run the extended check and the check runs
for a long time, when would we decide to run that? This sounds like it
will cause a production outage, so what are the pre-conditions that
would lead us to say "we'd better run this". For example, if running
this is known to be signficantly faster than running CREATE INDEX,
that might be an argument for someone to run this first if index
corruption is suspected.

If it detects an issue, can it fix the issue for the index by
injecting correct entries? If not then we will have to run CREATE
INDEX afterwards anyway, which makes it more likely that people would
just run CREATE INDEX and not bother with the check.

So my initial questions are about when we would run this and making
sure that is documented.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Peter Geoghegan

On Wed, Mar 28, 2018 at 5:47 AM, Pavan Deolasee
 wrote:
> Mostly a nitpick, but I guess we should leave a comment after
> IndexBuildHeapScan() saying heap_endscan() is not necessary since
> IndexBuildHeapScan() does that internally. I stumbled upon that while
> looking for any potential leaks. I know at least one other caller of
> IndexBuildHeapScan() doesn't bother to say anything either, but it's
> helpful.

Fair point. Again, I'm going to suggest deferring to the committer. I
seem to have decision fatigue this week.

> FWIW I also looked at the 0001 patch and it looks fine to me.

I'm grateful that you didn't feel any need to encourage me to use
whatever the novel/variant filter du jour is! :-)

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Peter Geoghegan

On Wed, Mar 28, 2018 at 5:04 AM, Pavan Deolasee
 wrote:
> Hmm Ok. It still sounds backwards to me, but then English is not my first or
> even second language. "A heap scan later verifies the presence in the heap
> of all index tuples fingerprinted" sounds as if we expect to find all
> fingerprinted index tuples in the heap. But in reality, we check if all heap
> tuples have a fingerprinted index tuple.

Why don't we leave this to the committer that picks the patch up? I
don't actually mind very much. I suspect that I am too close to the
problem to be sure that I've explained it in the clearest possible
way.

>> You're right - there is a narrow window for REPEATABLE READ and
>> SERIALIZABLE transactions. This is a regression in v6, the version
>> removed the TransactionXmin test.
>>
>> I am tempted to fix this by calling GetLatestSnapshot() instead of
>> GetTransactionSnapshot(). However, that has a problem of its own -- it
>> won't work in parallel mode, and we're dealing with parallel
>> restricted functions, not parallel unsafe functions. I don't want to
>> change that to fix such a narrow issue. IMV, a better fix is to treat
>> this as a serialization failure. Attached patch, which applies on top
>> of v7, shows what I mean.
>
>
> Ok. I am happy with that.

Cool.

> Well pg_index entry will be visible and should be visible. Otherwise how do
> you even maintain a newly created index. I am not sure, but I guess we take
> fresh MVCC snapshots for catalog searches, even in RR transactions.

True, but that isn't what would happen with an SQL query that queries
the system catalogs. That only applies to how the system catalogs are
accessed internally, not how they'd almost certainly be accessed when
using amcheck.

>> > Are there any interesting cases around
>> > INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
>> > tuples, especially if those tuples were inserted/deleted by our own
>> > transaction? It probably worth thinking.
>
>
> Anything here that you would like to check? I understand that you may see
> such tuples only for catalog tables.

I think that the WARNING ought to be fine. We shouldn't ever see it,
but if we do it's probably due to a bug in an extension or something.
It doesn't seem particularly likely that someone could ever insert
into the table concurrently despite our having a ShareLock on the
relation. Even with corruption.

>> IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
>> ShareUpdateExclusiveLock (it just says SharedLock), because that lock
>> strength doesn't have anything to do with IndexBuildHeapRangeScan()
>> when it operates with an MVCC snapshot. I think that this means that
>> this patch doesn't need to update comments within
>> IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
>> independent.
>
>
> Ok, I agree. But note that we are now invoking that code with
> AccessShareLock() whereas the existing callers either use ShareLock or
> ShareUpdateExclusiveLock. That's probably does not matter, but it's a change
> worth noting.

Fair point, even though the ShareUpdateExclusiveLock case isn't
actually acknowledged by IndexBuildHeapRangeScan(). Can we leave this
one up to the committer, too? I find it very hard to argue either for
or against this, and I want to avoid "analysis paralysis" at this
important time.

> While that's true, I am thinking if there is anything we can do to stop this
> a consistency-checking utility to create other non-trivial side effects on
> the state of the database, even if that means we can not check all heap
> tuples. For example, can there be a way by which we allow concurrent vacuum
> or HOT prune to continue to prune away dead tuples, even if that means
> running bt_check_index() is some specialised way (such as not allowing in a
> transaction block) and the heap scan  might miss out some tuples? I don't
> know answer to that, but given that sometimes bt_check_index() may take
> hours if not days to finish, it seems worth thinking or at least documenting
> the side-effects somewhere.

That seems like a worthwhile goal for a heap checker that only checks
the structure of the heap, rather than something that checks the
consistency of an index against its table. Especially one that can run
without a connection to the database, for things like backup tools,
where performance is really important. There is certainly room for
that. For this particular enhancement, the similarity to CREATE INDEX
seems like an asset.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Pavan Deolasee

On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan  wrote:

>
> I don't think so. The transaction involved is still an ordinary user
> transaction.
>
>
Mostly a nitpick, but I guess we should leave a comment
after IndexBuildHeapScan() saying heap_endscan() is not necessary
since IndexBuildHeapScan()
does that internally. I stumbled upon that while looking for any potential
leaks. I know at least one other caller of IndexBuildHeapScan() doesn't
bother to say anything either, but it's helpful.

FWIW I also looked at the 0001 patch and it looks fine to me.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] A design for amcheck heapam verification

2018-03-28 Thread Pavan Deolasee

On Wed, Mar 28, 2018 at 2:48 AM, Peter Geoghegan  wrote:

> On Tue, Mar 27, 2018 at 6:48 AM, Pavan Deolasee
>  wrote:
> > + * When index-to-heap verification is requested, a Bloom filter is used
> to
> > + * fingerprint all tuples in the target index, as the index is
> traversed to
> > + * verify its structure.  A heap scan later verifies the presence in the
> > heap
> > + * of all index tuples fingerprinted within the Bloom filter.
> > + *
> >
> > Is that correct? Aren't we actually verifying the presence in the index
> of
> > all
> > heap tuples?
>
> I think that you could describe it either way. We're verifying the
> presence of heap tuples in the heap that ought to have been in the
> index (that is, most of those that were fingerprinted).
>

Hmm Ok. It still sounds backwards to me, but then English is not my first
or even second language. "A heap scan later verifies the presence in the
heap of all index tuples fingerprinted" sounds as if we expect to find all
fingerprinted index tuples in the heap. But in reality, we check if all
heap tuples have a fingerprinted index tuple.

>
> You're right - there is a narrow window for REPEATABLE READ and
> SERIALIZABLE transactions. This is a regression in v6, the version
> removed the TransactionXmin test.
>
> I am tempted to fix this by calling GetLatestSnapshot() instead of
> GetTransactionSnapshot(). However, that has a problem of its own -- it
> won't work in parallel mode, and we're dealing with parallel
> restricted functions, not parallel unsafe functions. I don't want to
> change that to fix such a narrow issue. IMV, a better fix is to treat
> this as a serialization failure. Attached patch, which applies on top
> of v7, shows what I mean.
>

Ok. I am happy with that.

>
> I think that this bug is practically impossible to hit, because we use
> the xmin from the pg_index tuple during "is index safe to use?"
> indcheckxmin/TransactionXmin checks (the patch that I attached adds a
> similar though distinct check), which raises another question for a
> REPEATABLE READ xact. That question is: How is a REPEATABLE READ
> transaction supposed to see the pg_index entry to get the index
> relation's oid to call a verification function in the first place?

Well pg_index entry will be visible and should be visible. Otherwise how do
you even maintain a newly created index. I am not sure, but I guess we take
fresh MVCC snapshots for catalog searches, even in RR transactions.

> My
> point is that there is no need for a more complicated solution than
> what I propose.
>

I agree on that.

>
> I don't think so. The way we compute OldestXmin for
> IndexBuildHeapRangeScan() is rather like a snapshot acquisition --
> GetOldestXmin() locks the proc array in shared mode. As you pointed
> out, the fact that it comes after everything else (fingerprinting)
> means that it must be equal to or later than what index scans saw,
> that allowed them to do the kill_prior_tuple() stuff (set LP_DEAD
> bits).
>
>
That's probably true.

>
> > Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_
> PROGRESS
> > tuples, especially if those tuples were inserted/deleted by our own
> > transaction? It probably worth thinking.
>

Anything here that you would like to check? I understand that you may see
such tuples only for catalog tables.

>
> IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap
> ShareUpdateExclusiveLock (it just says SharedLock), because that lock
> strength doesn't have anything to do with IndexBuildHeapRangeScan()
> when it operates with an MVCC snapshot. I think that this means that
> this patch doesn't need to update comments within
> IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems
> independent.
>

Ok, I agree. But note that we are now invoking that code
with AccessShareLock() whereas the existing callers either use ShareLock or
ShareUpdateExclusiveLock. That's probably does not matter, but it's a
change worth noting.

>
> > Is
> > there anything we can do to lessen that burden like telling other
> backends
> > to
> > ignore our xmin while computing OldestXmin (like vacuum does)?
>
> I don't think so. The transaction involved is still an ordinary user
> transaction.
>

While that's true, I am thinking if there is anything we can do to stop
this a consistency-checking utility to create other non-trivial side
effects on the state of the database, even if that means we can not check
all heap tuples. For example, can there be a way by which we allow
concurrent vacuum or HOT prune to continue to prune away dead tuples, even
if that means running bt_check_index() is some specialised way (such as not
allowing in a transaction block) and the heap scan  might miss out some
tuples? I don't know answer to that, but given that sometimes bt_check_index()
may take hours if not days to finish, it seems worth thinking or at least
documenting the side-effects somewhere.

Thanks,
Pavan

--

Re: [HACKERS] A design for amcheck heapam verification

2018-03-27 Thread Peter Geoghegan

On Tue, Mar 27, 2018 at 6:48 AM, Pavan Deolasee
 wrote:
> + * When index-to-heap verification is requested, a Bloom filter is used to
> + * fingerprint all tuples in the target index, as the index is traversed to
> + * verify its structure.  A heap scan later verifies the presence in the
> heap
> + * of all index tuples fingerprinted within the Bloom filter.
> + *
>
> Is that correct? Aren't we actually verifying the presence in the index of
> all
> heap tuples?

I think that you could describe it either way. We're verifying the
presence of heap tuples in the heap that ought to have been in the
index (that is, most of those that were fingerprinted).

As I say above the callback routine bt_tuple_present_callback(), we
blame any corruption on the heap, since that's more likely. It could
actually be the index which is corrupt, though, in rare cases where it
manages to pass the index structure tests despite being corrupt. This
general orientation comes through in the comment that you asked about.

> Can we come up with a better name for heapallindexed? May be
> "check_heapindexed"?

I don't think that that name is any better, and you're the first to
mention it in almost a year. I'd rather leave it. The fact that it's a
check is pretty clear from context.

> What I am not sure about is whether we can really examine an index which is
> valid but whose indcheckxmin hasn't crossed our xmin horizon? Notice that
> this
> amcheck might be running in a transaction block, probably in a
> repeatable-read
> isolation level and hence GetTransactionSnapshot() might actually return a
> snapshot which can't yet read the index consistently. In practice, this is
> quite unlikely, but I think we should check for that case if we agree that
> it
> could be a problem.

You're right - there is a narrow window for REPEATABLE READ and
SERIALIZABLE transactions. This is a regression in v6, the version
removed the TransactionXmin test.

I am tempted to fix this by calling GetLatestSnapshot() instead of
GetTransactionSnapshot(). However, that has a problem of its own -- it
won't work in parallel mode, and we're dealing with parallel
restricted functions, not parallel unsafe functions. I don't want to
change that to fix such a narrow issue. IMV, a better fix is to treat
this as a serialization failure. Attached patch, which applies on top
of v7, shows what I mean.

I think that this bug is practically impossible to hit, because we use
the xmin from the pg_index tuple during "is index safe to use?"
indcheckxmin/TransactionXmin checks (the patch that I attached adds a
similar though distinct check), which raises another question for a
REPEATABLE READ xact. That question is: How is a REPEATABLE READ
transaction supposed to see the pg_index entry to get the index
relation's oid to call a verification function in the first place? My
point is that there is no need for a more complicated solution than
what I propose.

Note that the attached patch doesn't update the existing comments on
TransactionXmin, since I see this as a serialization error, which is a
distinct thing -- I'm not actually doing anything with
TransactionXmin.

> The case with readonly mode is also interesting. Since we're scanning heap
> with
> SnapshotAny, heapscan may return us tuples which are RECENTLY_DEAD. So the
> question is: can this happen?
>
> - some concurrent index scan sees a heaptuple as DEAD and marks the index
>   entry as LP_DEAD
> - our index fingerprinting sees index tuple as LP_DEAD
> - our heap scan still sees the heaptuple as RECENTLY_DEAD
>
> Now that looks improbable given that we compute OldestXmin after our index
> fingerprinting was done i.e between step 2 and 3 and hence if a tuple looked
> DEAD to some OldestXmin/RecentGlobalXmin computed before we computed our
> OldestXmin, then surely our OldestXmin should also see the tuple DEAD. Or is
> there a corner case that we're missing?

I don't think so. The way we compute OldestXmin for
IndexBuildHeapRangeScan() is rather like a snapshot acquisition --
GetOldestXmin() locks the proc array in shared mode. As you pointed
out, the fact that it comes after everything else (fingerprinting)
means that it must be equal to or later than what index scans saw,
that allowed them to do the kill_prior_tuple() stuff (set LP_DEAD
bits).

The one exception is Hot Standby mode, where it's possible for the
result of GetOldestXmin() (OldestXmin) to go backwards across
successive calls. However, that's not a problem for us because
readonly heapallindexed verification does not work in Hot Standby
mode.

> Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
> tuples, especially if those tuples were inserted/deleted by our own
> transaction? It probably worth thinking.
>
> Apart from that, comments in IndexBuildHeapRangeScan() claim that the
> function
> is called only with ShareLock and above, which is no longer true. We should
> check if that has any side-effects. I can't

Re: [HACKERS] A design for amcheck heapam verification

2018-03-27 Thread Pavan Deolasee

On Tue, Mar 27, 2018 at 8:50 AM, Peter Geoghegan  wrote:

> On Fri, Mar 23, 2018 at 7:13 AM, Andrey Borodin 
> wrote:
> > I've just flipped patch to WoA. But if above issues will be fixed I
> think that patch is ready for committer.
>
> Attached is v7, which has the small tweaks that you suggested.
>
> Thank you for the review. I hope that this can be committed shortly.
>
>
Sorry for coming a bit too late on this thread, but I started looking at
0002 patch.

  *
+ * When index-to-heap verification is requested, a Bloom filter is used to
+ * fingerprint all tuples in the target index, as the index is traversed to
+ * verify its structure.  A heap scan later verifies the presence in the
heap
+ * of all index tuples fingerprinted within the Bloom filter.
+ *

Is that correct? Aren't we actually verifying the presence in the index of
all
heap tuples?

@@ -116,37 +140,47 @@ static inline bool
invariant_leq_nontarget_offset(BtreeCheckState *state,
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber
blocknum);

 /*
- * bt_index_check(index regclass)
+ * bt_index_check(index regclass, heapallindexed boolean)

Can we come up with a better name for heapallindexed? May be
"check_heapindexed"?

+
+ /*
+ * Register our own snapshot in !readonly case, rather than asking
+ * IndexBuildHeapScan() to do this for us later.  This needs to happen
+ * before index fingerprinting begins, so we can later be certain that
+ * index fingerprinting should have reached all tuples returned by
+ * IndexBuildHeapScan().
+ */
+ if (!state->readonly)
+ snapshot = RegisterSnapshot(GetTransactionSnapshot());
+ }
+

So this looks safe. In !readonly mode, we take the snapshot *before*
fingerprinting the index. Since we're using MVCC snapshot, any tuple which
is
visible to heapscan must be reachable via indexscan too. So we must find the
index entry for every heap tuple returned by the scan.

What I am not sure about is whether we can really examine an index which is
valid but whose indcheckxmin hasn't crossed our xmin horizon? Notice that
this
amcheck might be running in a transaction block, probably in a
repeatable-read
isolation level and hence GetTransactionSnapshot() might actually return a
snapshot which can't yet read the index consistently. In practice, this is
quite unlikely, but I think we should check for that case if we agree that
it
could be a problem.

The case with readonly mode is also interesting. Since we're scanning heap
with
SnapshotAny, heapscan may return us tuples which are RECENTLY_DEAD. So the
question is: can this happen?

- some concurrent index scan sees a heaptuple as DEAD and marks the index
  entry as LP_DEAD
- our index fingerprinting sees index tuple as LP_DEAD
- our heap scan still sees the heaptuple as RECENTLY_DEAD

Now that looks improbable given that we compute OldestXmin after our index
fingerprinting was done i.e between step 2 and 3 and hence if a tuple looked
DEAD to some OldestXmin/RecentGlobalXmin computed before we computed our
OldestXmin, then surely our OldestXmin should also see the tuple DEAD. Or is
there a corner case that we're missing?

Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_PROGRESS
tuples, especially if those tuples were inserted/deleted by our own
transaction? It probably worth thinking.

Apart from that, comments in IndexBuildHeapRangeScan() claim that the
function
is called only with ShareLock and above, which is no longer true. We should
check if that has any side-effects. I can't think of any, but better to
verify
and update the comments to reflect new reality,

The partial indexes look fine since the non-interesting tuples never get
called
back.

One thing that worth documenting/noting is the fact that a !readonly check
will
run with a long-duration registered snapshot, thus holding OldestXmin back.
Is
there anything we can do to lessen that burden like telling other backends
to
ignore our xmin while computing OldestXmin (like vacuum does)?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] A design for amcheck heapam verification

2018-03-26 Thread Peter Geoghegan

On Fri, Mar 23, 2018 at 7:13 AM, Andrey Borodin  wrote:
> I've just flipped patch to WoA. But if above issues will be fixed I think 
> that patch is ready for committer.

Attached is v7, which has the small tweaks that you suggested.

Thank you for the review. I hope that this can be committed shortly.

-- 
Peter Geoghegan
From ede1ba731dc818172a94adbb6331323c1f2b1170 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH v7 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter.  It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.
---
 src/backend/lib/Makefile   |   4 +-
 src/backend/lib/README |   2 +
 src/backend/lib/bloomfilter.c  | 304 +
 src/include/lib/bloomfilter.h  |  27 ++
 src/test/modules/Makefile  |   1 +
 src/test/modules/test_bloomfilter/.gitignore   |   4 +
 src/test/modules/test_bloomfilter/Makefile |  21 ++
 src/test/modules/test_bloomfilter/README   |  71 +
 .../test_bloomfilter/expected/test_bloomfilter.out |  25 ++
 .../test_bloomfilter/sql/test_bloomfilter.sql  |  22 ++
 .../test_bloomfilter/test_bloomfilter--1.0.sql |  10 +
 .../modules/test_bloomfilter/test_bloomfilter.c| 138 ++
 .../test_bloomfilter/test_bloomfilter.control  |   4 +
 src/tools/pgindent/typedefs.list   |   1 +
 14 files changed, 632 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h
 create mode 100644 src/test/modules/test_bloomfilter/.gitignore
 create mode 100644 src/test/modules/test_bloomfilter/Makefile
 create mode 100644 src/test/modules/test_bloomfilter/README
 create mode 100644 src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
 create mode 100644 src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.c
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index d1fefe43f2..191ea9bca2 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o dshash.o hyperloglog.o ilist.o \
-	   knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
+   ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index 5e5ba5e437..376ae273a9 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -3,6 +3,8 @@ in the backend:
 
 binaryheap.c - a binary heap
 
+bloomfilter.c - probabilistic, space-efficient set membership testing
+
 hyperloglog.c - a streaming cardinality estimator
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/bloomfilter.c b/src/backend/lib/bloomfilter.c
new file mode 100644
index 00..dcf32c015c
--- /dev/null
+++ b/src/backend/lib/bloomfilter.c
@@ -0,0 +1,304 @@
+/*-
+ *
+ * bloomfilter.c
+ *		Space-efficient set membership testing
+ *
+ * A Bloom filter is a probabilistic data structure that is used to test an
+ * element's membership of a set.  False positives are possible, but false
+ * negatives are not; a test of membership of the set returns either "possibly
+ * in set" or "definitely not in set".  This can be very space efficient when
+ * individual elements are larger than a few bytes, because elements are hashed
+ * in order to set bits in the Bloom filter bitset.
+ *
+ * Elements can be added to the set, but not removed.  The more elements that
+ * are added, the larger the probability of false positives.  Caller must hint
+ * an estimated total size of the set when its Bloom filter is initialized.
+ * This is used to balance the use of memory against the final false positive
+ * rate.
+ *
+ *

Re: [HACKERS] A design for amcheck heapam verification

2018-03-23 Thread Andrey Borodin

Hi!

> 8 февр. 2018 г., в 22:45, Peter Geoghegan  написал(а):
> 
> On Thu, Feb 8, 2018 at 6:05 AM, Andrey Borodin  wrote:
>> I do not see a reason behind hashing the seed.
> 
> It made some sense when I was XOR'ing it to mix. A uniform
> distribution of bits seemed desirable then, since random() won't use
> the most significant bit -- it generates random numbers in the range
> of 0 to 2^31-1. It does seem unnecessary now.
> 
>> Also, I'd like to reformulate this paragraph. I understand what you want to 
>> say, but the sentence is incorrect.
>> + * The Bloom filter behaves non-deterministically when caller passes a 
>> random
>> + * seed value.  This ensures that the same false positives will not occur 
>> from
>> + * one run to the next, which is useful to some callers.
>> Bloom filter behaves deterministically, but differently. This does not 
>> ensures any thing, but probably will give something with hight probability.
> 
> I agree that that's unclear. I should probably cut it down, and say
> something like "caller can pass a random seed to make it unlikely that
> the same false positives will occur from one run to the next".

I've just flipped patch to WoA. But if above issues will be fixed I think that 
patch is ready for committer.

Best regards, Andrey Borodin.

Re: [HACKERS] A design for amcheck heapam verification

2018-02-08 Thread Peter Geoghegan

On Thu, Feb 8, 2018 at 6:05 AM, Andrey Borodin  wrote:
> I do not see a reason behind hashing the seed.

It made some sense when I was XOR'ing it to mix. A uniform
distribution of bits seemed desirable then, since random() won't use
the most significant bit -- it generates random numbers in the range
of 0 to 2^31-1. It does seem unnecessary now.

> Also, I'd like to reformulate this paragraph. I understand what you want to 
> say, but the sentence is incorrect.
> + * The Bloom filter behaves non-deterministically when caller passes a random
> + * seed value.  This ensures that the same false positives will not occur 
> from
> + * one run to the next, which is useful to some callers.
> Bloom filter behaves deterministically, but differently. This does not 
> ensures any thing, but probably will give something with hight probability.

I agree that that's unclear. I should probably cut it down, and say
something like "caller can pass a random seed to make it unlikely that
the same false positives will occur from one run to the next".

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-02-08 Thread Andrey Borodin

Hi, Peter!

> 8 февр. 2018 г., в 4:56, Peter Geoghegan  написал(а):
> 
> * Faster modulo operations.
> 
> * Removed sdbmhash().

Thanks! I definitely like how Bloom filter is implemented now.

I could not understand meaning of this, but apparently this will not harm
+   /*
+* Caller will probably use signed 32-bit pseudo-random number, so hash
+* caller's value to get 64-bit seed value
+*/
+   filter->seed = DatumGetUInt64(hash_uint32_extended(seed, 0));
I do not see a reason behind hashing the seed.

Also, I'd like to reformulate this paragraph. I understand what you want to 
say, but the sentence is incorrect.
+ * The Bloom filter behaves non-deterministically when caller passes a random
+ * seed value.  This ensures that the same false positives will not occur from
+ * one run to the next, which is useful to some callers.
Bloom filter behaves deterministically, but differently. This does not ensures 
any thing, but probably will give something with hight probability.

Thanks!

Best regards, Andrey Borodin.

Re: [HACKERS] A design for amcheck heapam verification

2018-02-07 Thread Peter Geoghegan

On Mon, Feb 5, 2018 at 12:55 PM, Peter Geoghegan  wrote:
> Anyway, parallel CREATE INDEX added a new "scan" argument to
> IndexBuildHeapScan(), which caused this patch to bitrot. At a minimum,
> an additional NULL argument should be passed by amcheck. However, I
> have a better idea.
>
> ISTM that verify_nbtree.c should manage the heap scan itself, it the
> style of parallel CREATE INDEX. It can acquire its own MVCC snapshot
> for bt_index_check() (which pretends to be a CREATE INDEX
> CONCURRENTLY). There can be an MVCC snapshot acquired per index
> verified, a snapshot that is under the direct control of amcheck. The
> snapshot would be acquired at the start of verification on an index
> (when "heapallindex = true"), before the verification of the index
> structure even begins, and released at the very end of verification.

Attached patch fixes the parallel index build bitrot in this way. This
is version 6 of the patch.

This approach resulted in a nice reduction in complexity:
bt_index_check() and bt_index_parent_check() heapallindexed
verification operations both work in exactly the same way now, except
that bt_index_check() imitates a CREATE INDEX CONCURRENTLY (to match
the heavyweight relation locks acquired). This doesn't really need to
be explained as a special case anymore; bt_index_parent_check() is
like an ordinary CREATE INDEX, without any additional "TransactionXmin
heap tuple xmin recheck" complication.

A further benefit is that this makes running bt_index_check() checks
against many indexes more thorough, and easier to reason about. Users
won't have to worry about TransactionXmin becoming very stale when
many indexes are verified within a single command.

I made the following additional, unrelated changes based on various feedback:

* Faster modulo operations.

Andrey Borodin suggested that I make k_hashes() do fewer modulo
operations. While I don't want to change the algorithm to make this
happen, the overhead has been reduced. Modulo operations are now
performed through bitwise AND operations, which is possible only
because the bitset size is always a power of two. Apparently this is a
fairly common optimization for Bloom filters that use (enhanced)
double-hashing; we might as well do it this way.

I've really just transcribed the enhanced double hashing pseudo-code
from the Georgia Tech/Dillinger & Manolios paper into C code, so no
real change there; bloomfilter.c's k_hashes() is still closely based
on "5.2 Enhanced Double Hashing" from that same paper. Experience
suggests that we ought to be very conservative about developing novel
hashing techniques. Paranoid, even.

* New reference to the modulo bias effect.

Michael Paquier wondered why the Bloom filter was always a
power-of-two, which this addresses. (Of course, the "modulo bitwise
AND" optimization I just mentioned is another reason to limit
ourselves to power-of-two bitset sizes, albeit a new one.)

* Removed sdbmhash().

Michael also wanted to know more about sdbmhash(), due to some general
concern about its quality. I realized that it is best to avoid adding
a new general-purpose hash function, whose sole purpose is to be
different to hash_any(), when I could instead use
hash_uint32_extended() to get two 32-bit values all at once. Robert
suggested this approach at one point, actually, but for some reason I
didn't follow up until now.

-- 
Peter Geoghegan
From 2ff9dcace49ea590762701717235d87e13b03c6b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter.  It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.
---
 src/backend/lib/Makefile   |   4 +-
 src/backend/lib/README |   2 +
 src/backend/lib/bloomfilter.c  | 303 +
 src/include/lib/bloomfilter.h  |  27 ++
 src/test/modules/Makefile  |   1 +
 src/test/modules/test_bloomfilter/.gitignore   |   4 +
 src/test/modules/test_bloomfilter/Makefile |  21 ++
 src/test/modules/test_bloomfilter/README   |  71 +
 .../test_bloomfilter/expected/test_bloomfilter.out |  25 ++
 .../test_bloomfilter/sql/test_bloomfilter.sql  |  22 ++

Re: [HACKERS] A design for amcheck heapam verification

2018-02-05 Thread Peter Geoghegan

On Mon, Jan 22, 2018 at 6:07 PM, Michael Paquier
 wrote:
> Yep. I have provided the feedback I wanted for 0001 (no API change in
> the bloom facility by the way :( ), but I still wanted to look at 0002
> in depths.

I don't see a point in adding complexity that no caller will actually
use. It *might* become useful in the future, in which case it's no
trouble at all to come up with an alternative initialization routine.

Anyway, parallel CREATE INDEX added a new "scan" argument to
IndexBuildHeapScan(), which caused this patch to bitrot. At a minimum,
an additional NULL argument should be passed by amcheck. However, I
have a better idea.

ISTM that verify_nbtree.c should manage the heap scan itself, it the
style of parallel CREATE INDEX. It can acquire its own MVCC snapshot
for bt_index_check() (which pretends to be a CREATE INDEX
CONCURRENTLY). There can be an MVCC snapshot acquired per index
verified, a snapshot that is under the direct control of amcheck. The
snapshot would be acquired at the start of verification on an index
(when "heapallindex = true"), before the verification of the index
structure even begins, and released at the very end of verification.
Advantages of this include:

1. It simplifies the code, and in particular lets us remove the use of
TransactionXmin. Comments already say that this TransactionXmin
business is a way of approximating our own MVCC snapshot acquisition
-- why not *actually do* the MVCC snapshot acquisition, now that
that's possible?

2. It makes the code for bt_index_check() and bt_index_parent_check()
essentially identical, except for varying
IndexBuildHeapScan()/indexInfo parameters to match what CREATE INDEX
itself does. The more we can outsource to IndexBuildHeapScan(), the
better.

3. It ensures that transactions that heapallindexed verify many
indexes in one go are at no real disadvantage to transactions that
heapallindexed verify a single index, since TransactionXmin going
stale won't impact verification (we won't have to skip Bloom filter
probes due to the uncertainty about what should be in the Bloom
filter).

4. It will make parallel verification easier in the future, which is
something that we ought to make happen. Parallel verification would
target a table with multiple indexes, and use a parallel heap scan. It
actually looks like making this work would be fairly easy. We'd only
need to copy code from nbtsort.c, and arrange for parallel workers to
verify an index each ahead of the heap scan. (There would be multiple
Bloom filters in shared memory, all of which parallel workers end up
probing.)

Thoughts?

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-01-22 Thread Michael Paquier

On Mon, Jan 22, 2018 at 03:22:05PM -0800, Peter Geoghegan wrote:
> Michael said he'd do more review. I generally feel this is close, though.

Yep. I have provided the feedback I wanted for 0001 (no API change in
the bloom facility by the way :( ), but I still wanted to look at 0002
in depths.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] A design for amcheck heapam verification

2018-01-22 Thread Peter Geoghegan

Hi,

On Fri, Jan 12, 2018 at 1:41 AM, Andrey Borodin  wrote:
> I've looked into the code and here's my review.
>
> The new functionality works and is useful right now. I believe it should be 
> shipped in the Postgres box (currently, I install it with packet managers).
> The code is nice and well documented.

Thanks.

> I'd be happy, if I could pass argument like ErrorLevel, which would help me 
> to estimate scale of corruption. Or any other way to list more than one 
> error. But, clearly, this functionality is not necessary for this patch.

My previous remarks apply here, too. I don't know how to rate severity
among error messages.

> Also, I couldn't find where is check that ensures that there is a heap tuple 
> for every B-tree leaf tuple. Is it there?

No, there isn't, because that's inherently race-prone. This is not so
bad. If you don't end up discovering a problem the other way around
(going from the heap to the index/Bloom filter), then the only way
that an index tuple pointing to the wrong place can go undetected is
because there is a duplicate heap TID in the index, or the heap TID
doesn't exist in any form.

I actually prototyped a patch that makes bt_index_parent_check()
detect duplicate heap TIDs in an index, by collecting heap TIDs from
the index, sorting them, and scanning the sorted array. That seems
like material for another patch, though.

> I've found few neatpicks in bloom filter:
> 1. Choice of sdbmhash does not seems like a best option from my point of view.

I don't mind changing the second hash function, but I want to
emphasize that this shouldn't be expected to make anything more than a
very small difference. The bloom filter probes are slow because the
memory accesses are random.

There are many hash functions that would be suitable here, and not too
many reasons to prefer one over the other. My choice was based on a
desire for simplicity, and for something that seemed to be in
widespread usage for a long time.

> +   bitset_bytes = Max(1024L * 1024L, bitset_bytes);
> bloom_work_mem was supposed to be the limit? Why we do not honor this limit?

The minimum maintenance_work_mem setting is 1MB anyway.

> 3.
> +   filter = palloc0(offsetof(bloom_filter, bitset) +
> +sizeof(unsigned char) * 
> bitset_bytes);
> sizeof(unsigned char) == 1 by C standard

I know. That's just what I prefer to do as a matter of style.

> 4. function my_bloom_power() returns bit numers, then it's result is powered 
> INT64CONST(1) << bloom_power; back. I'd stik with removing bits in a loop by 
> while(target_bitset_bits & (target_bitset_bits-1)) { 
> target_bitset_bits&=target_bitset_bits-1; } or something like that. Or, if we 
> use code like sdbm hash, fallback to bithacks :) 
> https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2

my_bloom_power() is only called once per Bloom filter. So again, I
think that this is not very important. Thomas Munro talked about using
popcount() in another function, which is from the book Hacker's
Delight. While I'm sure that these techniques have their use, they
just don't seem to make sense when the alternative is simpler code
that is only going to be executed at most once per verification
operation anyway.

> for (i = 0; i < filter->k_hash_funcs; i++)
> {
> /* Accumulate hash value for caller */
> hashes[i] = (hasha + i * hashb + i) % filter->bitset_bits;
> }
> }
> It produces almost same result (hashes 1..k-1 are +1'd), but uses a lot less 
> % operations. Potential overflow is governed by uint64 type.

I prefer to be conservative, and to stick to what is described by
Dillinger & Manolios as much as possible.

> That's all what I've found. I do not know did the patch had all necessary 
> reviewers attention. Please, feel free to change status if you think that 
> patch is ready. From my point of view, the patch is in a good shape.

Michael said he'd do more review. I generally feel this is close, though.

Thanks for the review
-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2018-01-11 Thread Andrey Borodin

Hello!

I like heapam verification functionality and use it right now. So, I'm planning 
to provide review for this patch, probably, this week.

From my current use I have some thoughts on interface. Here's what I get. 

# select bt_index_check('messagefiltervalue_group_id_59490523e6ee451f',true);
ERROR:  XX001: heap tuple (45,21) from table "messagefiltervalue" lacks 
matching index tuple within index "messagefiltervalue_group_id_59490523e6ee451f"
HINT:  Retrying verification using the function bt_index_parent_check() might 
provide a more specific error.
LOCATION:  bt_tuple_present_callback, verify_nbtree.c:1316
Time: 45.668 ms

# select bt_index_check('messagefiltervalue_group_id_59490523e6ee451f');
bt_index_check 


(1 row)
Time: 32.873 ms

# select bt_index_parent_check('messagefiltervalue_group_id_59490523e6ee451f');
ERROR:  XX002: down-link lower bound invariant violated for index 
"messagefiltervalue_group_id_59490523e6ee451f"
DETAIL:  Parent block=6259 child index tid=(1747,2) parent page 
lsn=4A0/728F5DA8.
LOCATION:  bt_downlink_check, verify_nbtree.c:1188
Time: 391194.113 ms


Seems like new check is working 4 orders of magnitudes faster then 
bt_index_parent_check() and still finds my specific error that bt_index_check() 
missed. 
From this output I see that there is corruption, but cannot understand:
1. What is the scale of corruption
2. Are these corruptions related or not

I think an interface to list all or top N error could be useful.

> 14 дек. 2017 г., в 0:02, Peter Geoghegan  написал(а):
>> 
>> This could also test the reproducibility of the tests with a fixed
>> seed number and at least two rounds, a low number of elements could be
>> more appropriate to limit the run time.
> 
> The runtime is already dominated by pg_regress overhead. As it says in
> the README, using a fixed seed in the test harness is pointless,
> because it won't behave in a fixed way across platforms. As long as we
> cannot ensure deterministic behavior, we may as well fully embrace
> non-determinism.
I think that determinism across platforms is not that important as determinism 
across runs.


Thanks for the amcheck! It is very useful.

Best regards, Andrey Borodin.

Re: [HACKERS] A design for amcheck heapam verification

2017-12-13 Thread Peter Geoghegan

On Mon, Dec 11, 2017 at 9:35 PM, Michael Paquier
 wrote:
> +   /*
> +* Generate random seed, or use caller's.  Seed should always be a 
> positive
> +* value less than or equal to PG_INT32_MAX, to ensure that any random 
> seed
> +* can be recreated through callerseed if the need arises.  (Don't assume
> +* that RAND_MAX cannot exceed PG_INT32_MAX.)
> +*/
> +   seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
> This could use pg_backend_random() instead.

I don't see the need for a cryptographically secure source of
randomness for any current Bloom filter caller, including the test
harness. There are many uses of random() just like this throughout the
backend, and far more random() calls than pg_backend_random() calls.
srandom() seeds random per backend, ensuring non-deterministic
behavior across backends.

> +--
> +-- These tests don't produce any interesting output, unless they fail.  For 
> an
> +-- explanation of the arguments, and the values used here, see README.
> +--
> +SELECT test_bloomfilter(power => 23,
> +nelements => 838861,
> +seed => -1,
> +tests => 1);
> This could also test the reproducibility of the tests with a fixed
> seed number and at least two rounds, a low number of elements could be
> more appropriate to limit the run time.

The runtime is already dominated by pg_regress overhead. As it says in
the README, using a fixed seed in the test harness is pointless,
because it won't behave in a fixed way across platforms. As long as we
cannot ensure deterministic behavior, we may as well fully embrace
non-determinism.

> I would vote for simplifying the initialization routine and just
> directly let the caller decide it. Are there implementation
> complications if this is not a power of two? I cannot guess one by
> looking at the code.

Yes, there are. It's easier to reason about the implementation with
these restrictions.

> So I would expect people using this API
> are smart enough to do proper sizing. Note that some callers may
> accept even a larger false positive rate. In short, this basically
> brings us back to Thomas' point upthread with a size estimation
> routine, and an extra routine doing the initialization:
> https://www.postgresql.org/message-id/CAEepm=0dy53x1hg5dmyzmpv_kn99crxzqbto8gmiosxnyrx...@mail.gmail.com
> Did you consider it? Splitting the size estimation and the actual
> initialization has a lot of benefits in my opinion.

Callers don't actually need to worry about these details. I think it's
the wrong call to complicate the interface to simplify the
implementation.

Thomas was not arguing for the caller being able to specify a false
positive rate to work backwards from -- that's not an unreasonable
thing, but it seems unlikely to be of use to amcheck, or parallel hash
join. Thomas was arguing for making the Bloom filter suitable for use
with DSM. I ended up incorporating most of his feedback. The only
things I were not added were a DSM-orientated routine for getting the
amount of memory required, as well as a separate DSM-orientated
routine for initializing caller's shared memory buffer. That's likely
inconvenient for most callers, and could be restrictive if and when
Bloom filters use resources other than memory, such as temp files.

> +/*
> + * What proportion of bits are currently set?
> + *
> + * Returns proportion, expressed as a multiplier of filter size.
> + *
> [...]
> + */
> +double
> +bloom_prop_bits_set(bloom_filter *filter)
> Instead of that, having a function that prints direct information
> about the filter's state would be enough with the real number of
> elements and the number of bits set in the filter. I am not sure that
> using rates is a good idea, sometimes rounding can cause confusion.

The bloom filter doesn't track the number of elements itself. Callers
that care can do it themselves. bloom_prop_bits_set() isn't very
important, even compared to other types of instrumentation. It's
moderately useful as a generic indicator of whether or not the Bloom
filter was totally overwhelmed. That's about it.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2017-12-11 Thread Michael Paquier

On Fri, Dec 8, 2017 at 4:37 AM, Peter Geoghegan  wrote:
> Here, we repeat the same test 3 times, varying only the seed value
> used for each run.

Thanks for the updated version! Looking at 0001, I have run a coverage
test and can see all code paths covered, which is nice. It is also
easier to check the correctness of the library. There are many ways to
shape up such tests, you chose one we could live with.

> The last message is a WARNING because we exceed the 1% threshold
> (hard-coded into test_bloomfilter.c), though only by a tiny margin,
> due only to random variations in seed value. We round up to 10 bits
> per element for the regression tests. That's where the *actual*
> "nelements" argument comes from within the tests, so pg_regress tests
> should never see the WARNING (if they do, that counts as a failure).
>
> I've experimentally observed that we get the 1% false positive rate
> with any possible bitset when "nelements" works out at 9.6 bitset bits
> per element. Inter-run variation is tiny. With 50 tests, I didn't
> observe these same Bloom filter parameters produce a false positive
> rate that came near to 1.1%. 1.01% or 1.02% was about as bad as it
> got.

Nice. That's close to what I would expect with the sizing this is doing.

> There is a fairly extensive README, which I hope will clear up the
> theory behind the bloomfilter.c strategy on bitset size and false
> positives. Also, there was a regression that I had to fix in
> bloomfilter.c, in seeding. It didn't reliably cause variation in the
> false positives. And, there was bitrot with the documentation that I
> fixed up.

+   /*
+* Generate random seed, or use caller's.  Seed should always be a positive
+* value less than or equal to PG_INT32_MAX, to ensure that any random seed
+* can be recreated through callerseed if the need arises.  (Don't assume
+* that RAND_MAX cannot exceed PG_INT32_MAX.)
+*/
+   seed = callerseed < 0 ? random() % PG_INT32_MAX : callerseed;
This could use pg_backend_random() instead.

+--
+-- These tests don't produce any interesting output, unless they fail.  For an
+-- explanation of the arguments, and the values used here, see README.
+--
+SELECT test_bloomfilter(power => 23,
+nelements => 838861,
+seed => -1,
+tests => 1);
This could also test the reproducibility of the tests with a fixed
seed number and at least two rounds, a low number of elements could be
more appropriate to limit the run time.

+   /*
+* Aim for two bytes per element; this is sufficient to get a false
+* positive rate below 1%, independent of the size of the bitset or total
+* number of elements.  Also, if rounding down the size of the bitset to
+* the next lowest power of two turns out to be a significant drop, the
+* false positive rate still won't exceed 2% in almost all cases.
+*/
+   bitset_bytes = Min(bloom_work_mem * 1024L, total_elems * 2);
+   /* Minimum allowable size is 1MB */
+   bitset_bytes = Max(1024L * 1024L, bitset_bytes);
I would vote for simplifying the initialization routine and just
directly let the caller decide it. Are there implementation
complications if this is not a power of two? I cannot guess one by
looking at the code. I think that the key point is just to document
that a false positive rate of 1% can be reached with just having
9.6bits per elements, and that this factor can be reduced by 10 if
adding 4.8 bits per elements. So I would expect people using this API
are smart enough to do proper sizing. Note that some callers may
accept even a larger false positive rate. In short, this basically
brings us back to Thomas' point upthread with a size estimation
routine, and an extra routine doing the initialization:
https://www.postgresql.org/message-id/CAEepm=0dy53x1hg5dmyzmpv_kn99crxzqbto8gmiosxnyrx...@mail.gmail.com
Did you consider it? Splitting the size estimation and the actual
initialization has a lot of benefits in my opinion.

+/*
+ * What proportion of bits are currently set?
+ *
+ * Returns proportion, expressed as a multiplier of filter size.
+ *
[...]
+ */
+double
+bloom_prop_bits_set(bloom_filter *filter)
Instead of that, having a function that prints direct information
about the filter's state would be enough with the real number of
elements and the number of bits set in the filter. I am not sure that
using rates is a good idea, sometimes rounding can cause confusion.
-- 
Michael

Re: [HACKERS] A design for amcheck heapam verification

2017-12-07 Thread Peter Geoghegan

On Tue, Nov 28, 2017 at 9:54 PM, Peter Geoghegan  wrote:
> On Tue, Nov 28, 2017 at 9:50 PM, Michael Paquier
>  wrote:
>>> Would that address your concern? There would be an SQL interface, but
>>> it would be trivial.
>>
>> That's exactly what I think you should do, and mentioned so upthread.
>> A SQL interface can also show a good example of how developers can use
>> this API.

Attach revision, v5, adds a new test harness -- test_bloomfilter.

This can be used to experimentally verify that the meets the well
known "1% false positive rate with 9.6 bits per element" standard. It
manages to do exactly that:

postgres=# set client_min_messages = 'debug1';
SET
postgres=# SELECT test_bloomfilter(power => 23, nelements => 873813,
seed => -1, tests => 3);
DEBUG:  beginning test #1...
DEBUG:  bloom_work_mem (KB): 1024
DEBUG:  false positives: 8630 (rate: 0.009876, proportion bits set:
0.517625, seed: 1373191603)
DEBUG:  beginning test #2...
DEBUG:  bloom_work_mem (KB): 1024
DEBUG:  false positives: 8623 (rate: 0.009868, proportion bits set:
0.517623, seed: 406665822)
DEBUG:  beginning test #3...
DEBUG:  bloom_work_mem (KB): 1024
WARNING:  false positives: 8840 (rate: 0.010117, proportion bits set:
0.517748, seed: 398116374)
 test_bloomfilter
--

(1 row)

Here, we repeat the same test 3 times, varying only the seed value
used for each run.

The last message is a WARNING because we exceed the 1% threshold
(hard-coded into test_bloomfilter.c), though only by a tiny margin,
due only to random variations in seed value. We round up to 10 bits
per element for the regression tests. That's where the *actual*
"nelements" argument comes from within the tests, so pg_regress tests
should never see the WARNING (if they do, that counts as a failure).

I've experimentally observed that we get the 1% false positive rate
with any possible bitset when "nelements" works out at 9.6 bitset bits
per element. Inter-run variation is tiny. With 50 tests, I didn't
observe these same Bloom filter parameters produce a false positive
rate that came near to 1.1%. 1.01% or 1.02% was about as bad as it
got.

There is a fairly extensive README, which I hope will clear up the
theory behind the bloomfilter.c strategy on bitset size and false
positives. Also, there was a regression that I had to fix in
bloomfilter.c, in seeding. It didn't reliably cause variation in the
false positives. And, there was bitrot with the documentation that I
fixed up.

-- 
Peter Geoghegan
From 47f2c6cd398244f11c3490b644cd225beac9ae31 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter.  It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.
---
 src/backend/lib/Makefile   |   4 +-
 src/backend/lib/README |   2 +
 src/backend/lib/bloomfilter.c  | 313 +
 src/include/lib/bloomfilter.h  |  27 ++
 src/test/modules/Makefile  |   1 +
 src/test/modules/test_bloomfilter/.gitignore   |   4 +
 src/test/modules/test_bloomfilter/Makefile |  21 ++
 src/test/modules/test_bloomfilter/README   |  72 +
 .../test_bloomfilter/expected/test_bloomfilter.out |  25 ++
 .../test_bloomfilter/sql/test_bloomfilter.sql  |  22 ++
 .../test_bloomfilter/test_bloomfilter--1.0.sql |  10 +
 .../modules/test_bloomfilter/test_bloomfilter.c| 138 +
 .../test_bloomfilter/test_bloomfilter.control  |   4 +
 13 files changed, 641 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/bloomfilter.c
 create mode 100644 src/include/lib/bloomfilter.h
 create mode 100644 src/test/modules/test_bloomfilter/.gitignore
 create mode 100644 src/test/modules/test_bloomfilter/Makefile
 create mode 100644 src/test/modules/test_bloomfilter/README
 create mode 100644 src/test/modules/test_bloomfilter/expected/test_bloomfilter.out
 create mode 100644 src/test/modules/test_bloomfilter/sql/test_bloomfilter.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter--1.0.sql
 create mode 100644 src/test/modules/test_bloomfilter/test_bloomfilter.c
 create mode 100644

Re: [HACKERS] A design for amcheck heapam verification

2017-11-28 Thread Michael Paquier

On Wed, Nov 29, 2017 at 2:54 PM, Peter Geoghegan  wrote:
> My understanding of your earlier remarks, rightly or wrongly, was that
> you wanted me to adopt the Bloom filter to actually be usable from SQL
> in some kind of general way. As opposed to what I just said -- adding
> a stub SQL interface that simply invokes the test harness, with all
> the heavy lifting taking place in C code.
>
> Obviously these are two very different things. I'm quite happy to add
> the test harness.

Quote from this email:
https://www.postgresql.org/message-id/CAB7nPqSUKppzvNSHY1OM_TdSj0UE18xNFCrOwPC3E8svq7Mb_Q%40mail.gmail.com

> One first thing striking me is that there is no test for this
> implementation, which would be a base stone for other things, it would
> be nice to validate that things are working properly before moving on
> with 0002, and 0001 is a feature on its own. I don't think that it
> would be complicated to have a small module in src/test/modules which
> plugs in a couple of SQL functions on top of bloomfilter.h.

My apologies if this sounded like having a set of SQL functions in
core, I meant a test suite from the beginning with an extension
creating the interface or such.
-- 
Michael

Re: [HACKERS] A design for amcheck heapam verification

2017-11-28 Thread Peter Geoghegan

On Tue, Nov 28, 2017 at 9:50 PM, Michael Paquier
 wrote:
>> Would that address your concern? There would be an SQL interface, but
>> it would be trivial.
>
> That's exactly what I think you should do, and mentioned so upthread.
> A SQL interface can also show a good example of how developers can use
> this API.

My understanding of your earlier remarks, rightly or wrongly, was that
you wanted me to adopt the Bloom filter to actually be usable from SQL
in some kind of general way. As opposed to what I just said -- adding
a stub SQL interface that simply invokes the test harness, with all
the heavy lifting taking place in C code.

Obviously these are two very different things. I'm quite happy to add
the test harness.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2017-11-28 Thread Peter Geoghegan

On Tue, Nov 28, 2017 at 9:38 PM, Michael Paquier
 wrote:
> My apologies for slacking here. I would still welcome some regression
> tests to stress the bloom API you are proposing! For now I am moving
> this patch to next CF.

I still don't think that regression tests as such make sense. However,
it seems like it might be a good idea to add a test harness for the
Bloom filter code. I actually wrote code like this for myself during
development, that could be cleaned up. The hardness can live in
source/src/test/modules/test_bloom_filter. We already do this for the
red-black tree library code, for example, and it seems like good
practice.

Would that address your concern? There would be an SQL interface, but
it would be trivial.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

2017-11-28 Thread Michael Paquier

On Sat, Oct 21, 2017 at 9:34 AM, Peter Geoghegan  wrote:
> I should point out that I shipped virtually the same code yesterday,
> as v1.1 of the Github version of amcheck (also known as amcheck_next).
> Early adopters will be able to use this new "heapallindexed"
> functionality in the next few days, once packages become available for
> the apt and yum community repos. Just as before, the Github version
> will work on versions of Postgres >= 9.4.
>
> This seems like good timing on my part, because we know that this new
> "heapallindexed" verification will detect the "freeze the dead" bugs
> that the next point release is set to have fixes for -- that is
> actually kind of how one of the bugs was found [1]. We may even want
> to advertise the available of this check within amcheck_next, in the
> release notes for the next Postgres point release.

My apologies for slacking here. I would still welcome some regression
tests to stress the bloom API you are proposing! For now I am moving
this patch to next CF.
-- 
Michael

42 matches

Mail list logo