Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-10-02 Thread Claudio Freire
On Sun, Oct 1, 2017 at 8:36 PM, Daniel Gustafsson  wrote:
>> On 18 Aug 2017, at 13:39, Claudio Freire  wrote:
>>
>> On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire  
>> wrote:
>>> Indeed they do, and that's what motivated this patch. But I'd need
>>> TB-sized tables to set up something like that. I don't have the
>>> hardware or time available to do that (vacuum on bloated TB-sized
>>> tables can take days in my experience). Scale 4000 is as big as I can
>>> get without running out of space for the tests in my test hardware.
>>>
>>> If anybody else has the ability, I'd be thankful if they did test it
>>> under those conditions, but I cannot. I think Anastasia's test is
>>> closer to such a test, that's probably why it shows a bigger
>>> improvement in total elapsed time.
>>>
>>> Our production database could possibly be used, but it can take about
>>> a week to clone it, upgrade it (it's 9.5 currently), and run the
>>> relevant vacuum.
>>
>> It looks like I won't be able to do that test with a production
>> snapshot anytime soon.
>>
>> Getting approval for the budget required to do that looks like it's
>> going to take far longer than I thought.
>>
>> Regardless of that, I think the patch can move forward. I'm still
>> planning to do the test at some point, but this patch shouldn't block
>> on it.
>
> This patch has been marked Ready for committer after review, but wasn’t
> committed in the current commitfest so it will be moved to the next.  Since it
> no longer applies cleanly, it’s being reset to Waiting for author though.
>
> cheers ./daniel

Rebased version of the patches attached
From 8d4ea3bc7b67bc535ac26109a65932044e4a0ab7 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 2 Oct 2017 10:56:40 -0300
Subject: [PATCH 1/2] [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 404 ---
 src/test/regress/expected/vacuum.out |  26 +++
 src/test/regress/sql/vacuum.sql  |  19 ++
 3 files changed, 375 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 30b1c08c6c..ae1e9afbf3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -92,6 +105,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -103,6 +124,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-10-01 Thread Daniel Gustafsson
> On 18 Aug 2017, at 13:39, Claudio Freire  wrote:
> 
> On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire  
> wrote:
>> Indeed they do, and that's what motivated this patch. But I'd need
>> TB-sized tables to set up something like that. I don't have the
>> hardware or time available to do that (vacuum on bloated TB-sized
>> tables can take days in my experience). Scale 4000 is as big as I can
>> get without running out of space for the tests in my test hardware.
>> 
>> If anybody else has the ability, I'd be thankful if they did test it
>> under those conditions, but I cannot. I think Anastasia's test is
>> closer to such a test, that's probably why it shows a bigger
>> improvement in total elapsed time.
>> 
>> Our production database could possibly be used, but it can take about
>> a week to clone it, upgrade it (it's 9.5 currently), and run the
>> relevant vacuum.
> 
> It looks like I won't be able to do that test with a production
> snapshot anytime soon.
> 
> Getting approval for the budget required to do that looks like it's
> going to take far longer than I thought.
> 
> Regardless of that, I think the patch can move forward. I'm still
> planning to do the test at some point, but this patch shouldn't block
> on it.

This patch has been marked Ready for committer after review, but wasn’t
committed in the current commitfest so it will be moved to the next.  Since it
no longer applies cleanly, it’s being reset to Waiting for author though.

cheers ./daniel

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-08-18 Thread Claudio Freire
On Fri, Apr 7, 2017 at 10:51 PM, Claudio Freire  wrote:
> Indeed they do, and that's what motivated this patch. But I'd need
> TB-sized tables to set up something like that. I don't have the
> hardware or time available to do that (vacuum on bloated TB-sized
> tables can take days in my experience). Scale 4000 is as big as I can
> get without running out of space for the tests in my test hardware.
>
> If anybody else has the ability, I'd be thankful if they did test it
> under those conditions, but I cannot. I think Anastasia's test is
> closer to such a test, that's probably why it shows a bigger
> improvement in total elapsed time.
>
> Our production database could possibly be used, but it can take about
> a week to clone it, upgrade it (it's 9.5 currently), and run the
> relevant vacuum.

It looks like I won't be able to do that test with a production
snapshot anytime soon.

Getting approval for the budget required to do that looks like it's
going to take far longer than I thought.

Regardless of that, I think the patch can move forward. I'm still
planning to do the test at some point, but this patch shouldn't block
on it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-27 Thread Masahiko Sawada
On Fri, Apr 21, 2017 at 6:24 AM, Claudio Freire  wrote:
> On Wed, Apr 12, 2017 at 4:35 PM, Robert Haas  wrote:
>> On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire  
>> wrote:
>>> In essence, the patch as it is proposed, doesn't *need* a binary
>>> search, because the segment list can only grow up to 15 segments at
>>> its biggest, and that's a size small enough that linear search will
>>> outperform (or at least perform as well as) binary search. Reducing
>>> the initial segment size wouldn't change that. If the 12GB limit is
>>> lifted, or the maximum segment size reduced (from 1GB to 128MB for
>>> example), however, that would change.
>>>
>>> I'd be more in favor of lifting the 12GB limit than of reducing the
>>> maximum segment size, for the reasons above. Raising the 12GB limit
>>> has concrete and readily apparent benefits, whereas using bigger (or
>>> smaller) segments is far more debatable. Yes, that will need a binary
>>> search. But, I was hoping that could be a second (or third) patch, to
>>> keep things simple, and benefits measurable.
>>
>> To me, it seems a bit short-sighted to say, OK, let's use a linear
>> search because there's this 12GB limit so we can limit ourselves to 15
>> segments.  Because somebody will want to remove that 12GB limit, and
>> then we'll have to revisit the whole thing anyway.  I think, anyway.
>
> Ok, attached an updated patch that implements the binary search
>
>> What's not clear to me is how sensitive the performance of vacuum is
>> to the number of cycles used here.  For a large index, the number of
>> searches will presumably be quite large, so it does seem worth
>> worrying about performance.  But if we just always used a binary
>> search, would that lose enough performance with small numbers of
>> segments that anyone would care?  If so, maybe we need to use linear
>> search for small numbers of segments and switch to binary search with
>> larger numbers of segments.
>
> I just went and tested.
>
> I implemented the hybrid binary search attached, and ran a few tests
> with and without the sequential code enabled, at small scales.
>
> The difference is statistically significant, but small (less than 3%).
> With proper optimization of the binary search, however, the difference
> flips:
>
> claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
> fullbinary.s100.times
> vacuum_bench_s100.1.shufp80.log:CPU: user: 6.20 s, system: 1.42 s,
> elapsed: 18.34 s.
> vacuum_bench_s100.2.shufp80.log:CPU: user: 6.44 s, system: 1.40 s,
> elapsed: 19.75 s.
> vacuum_bench_s100.3.shufp80.log:CPU: user: 6.28 s, system: 1.41 s,
> elapsed: 18.48 s.
> vacuum_bench_s100.4.shufp80.log:CPU: user: 6.39 s, system: 1.51 s,
> elapsed: 20.60 s.
> vacuum_bench_s100.5.shufp80.log:CPU: user: 6.26 s, system: 1.42 s,
> elapsed: 19.16 s.
>
> claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
> hybridbinary.s100.times
> vacuum_bench_s100.1.shufp80.log:CPU: user: 6.49 s, system: 1.39 s,
> elapsed: 19.15 s.
> vacuum_bench_s100.2.shufp80.log:CPU: user: 6.36 s, system: 1.33 s,
> elapsed: 18.40 s.
> vacuum_bench_s100.3.shufp80.log:CPU: user: 6.36 s, system: 1.31 s,
> elapsed: 18.87 s.
> vacuum_bench_s100.4.shufp80.log:CPU: user: 6.59 s, system: 1.35 s,
> elapsed: 26.43 s.
> vacuum_bench_s100.5.shufp80.log:CPU: user: 6.54 s, system: 1.28 s,
> elapsed: 20.02 s.
>
> That's after inlining the compare on both the linear and sequential
> code, and it seems it lets the compiler optimize the binary search to
> the point where it outperforms the sequential search.
>
> That's not the case when the compare isn't inlined.
>
> That seems in line with [1], that show the impact of various
> optimizations on both algorithms. It's clearly a close enough race
> that optimizations play a huge role.
>
> Since we're not likely to go and implement SSE2-optimized versions, I
> believe I'll leave the binary search only. That's the attached patch
> set.
>
> I'm running the full test suite, but that takes a very long while.
> I'll post the results when they're done.
>
> [1] https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/

Thank you for updating the patch.

I've read this patch again and here are some review comments.

+ * Lookup in that structure proceeds sequentially in the list of segments,
+ * and with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity (2 log N to be
+ * precise).

IIUC we now do binary search even over the list of segments.

-

We often fetch a particular dead tuple segment. How about providing a
macro for easier understanding?
For example,

 #define GetDeadTuplsSegment(lvrelstats, seg) \
  (&(lvrelstats)->dead_tuples.dt_segments[(seg)])

-

+   if (vacrelstats->dead_tuples.num_segs == 0)
+   return;
+

+   /* If uninitialized, we have no tuples to delete from the indexes */
+   if 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-25 Thread Robert Haas
On Mon, Apr 24, 2017 at 3:57 PM, Claudio Freire  wrote:
> I wouldn't fret over the slight slowdown vs the old patch, it could be
> noise (the script only completed a single run at scale 400).

Yeah, seems fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-24 Thread Claudio Freire
On Sun, Apr 23, 2017 at 12:41 PM, Robert Haas  wrote:
>> That's after inlining the compare on both the linear and sequential
>> code, and it seems it lets the compiler optimize the binary search to
>> the point where it outperforms the sequential search.
>>
>> That's not the case when the compare isn't inlined.
>>
>> That seems in line with [1], that show the impact of various
>> optimizations on both algorithms. It's clearly a close enough race
>> that optimizations play a huge role.
>>
>> Since we're not likely to go and implement SSE2-optimized versions, I
>> believe I'll leave the binary search only. That's the attached patch
>> set.
>
> That sounds reasonable based on your test results.  I guess part of
> what I was wondering is whether a vacuum on a table large enough to
> require multiple gigabytes of work_mem isn't likely to be I/O-bound
> anyway.  If so, a few cycles one way or the other other isn't likely
> to matter much.  If not, where exactly are all of those CPU cycles
> going?

I haven't been able to produce a table large enough to get a CPU-bound
vacuum, so such a case is likely to require huge storage and a very
powerful I/O system. Mine can only get about 100MB/s tops, and at that
speed, vacuum is I/O bound even for multi-GB work_mem. That's why I've
been using the reported CPU time as benchmark.

BTW, I left the benchmark script running all weekend at the office,
and when I got back a power outage had aborted it. In a few days I'll
be out on vacation, so I'm not sure I'll get the benchmark results
anytime soon. But this patch moved to 11.0 I guess there's no rush.

Just FTR, in case I leave before the script is done, the script got to
scale 400 before the outage:

INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 4000 row versions
DETAIL:  CPU: user: 5.94 s, system: 1.26 s, elapsed: 26.77 s.
INFO:  "pgbench_accounts": removed 4000 row versions in 655739 pages
DETAIL:  CPU: user: 3.36 s, system: 2.57 s, elapsed: 61.67 s.
INFO:  index "pgbench_accounts_pkey" now contains 0 row versions in 109679 pages
DETAIL:  4000 index row versions were removed.
109289 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.06 s.
INFO:  "pgbench_accounts": found 38925546 removable, 0 nonremovable
row versions in 655738 out of 655738 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 1098
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 15.34 s, system: 6.95 s, elapsed: 126.21 s.
INFO:  "pgbench_accounts": truncated 655738 to 0 pages
DETAIL:  CPU: user: 0.22 s, system: 2.10 s, elapsed: 8.10 s.

In summary:

binsrch v10:

s100: CPU: user: 3.02 s, system: 1.51 s, elapsed: 16.43 s.
s400: CPU: user: 15.34 s, system: 6.95 s, elapsed: 126.21 s.

The old results:

Old Patched (sequential search):

s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.

Unpatched:

s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.

I wouldn't fret over the slight slowdown vs the old patch, it could be
noise (the script only completed a single run at scale 400).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-23 Thread Robert Haas
On Thu, Apr 20, 2017 at 5:24 PM, Claudio Freire  wrote:
>> What's not clear to me is how sensitive the performance of vacuum is
>> to the number of cycles used here.  For a large index, the number of
>> searches will presumably be quite large, so it does seem worth
>> worrying about performance.  But if we just always used a binary
>> search, would that lose enough performance with small numbers of
>> segments that anyone would care?  If so, maybe we need to use linear
>> search for small numbers of segments and switch to binary search with
>> larger numbers of segments.
>
> I just went and tested.

Thanks!

> That's after inlining the compare on both the linear and sequential
> code, and it seems it lets the compiler optimize the binary search to
> the point where it outperforms the sequential search.
>
> That's not the case when the compare isn't inlined.
>
> That seems in line with [1], that show the impact of various
> optimizations on both algorithms. It's clearly a close enough race
> that optimizations play a huge role.
>
> Since we're not likely to go and implement SSE2-optimized versions, I
> believe I'll leave the binary search only. That's the attached patch
> set.

That sounds reasonable based on your test results.  I guess part of
what I was wondering is whether a vacuum on a table large enough to
require multiple gigabytes of work_mem isn't likely to be I/O-bound
anyway.  If so, a few cycles one way or the other other isn't likely
to matter much.  If not, where exactly are all of those CPU cycles
going?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-20 Thread Claudio Freire
On Wed, Apr 12, 2017 at 4:35 PM, Robert Haas  wrote:
> On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire  
> wrote:
>> In essence, the patch as it is proposed, doesn't *need* a binary
>> search, because the segment list can only grow up to 15 segments at
>> its biggest, and that's a size small enough that linear search will
>> outperform (or at least perform as well as) binary search. Reducing
>> the initial segment size wouldn't change that. If the 12GB limit is
>> lifted, or the maximum segment size reduced (from 1GB to 128MB for
>> example), however, that would change.
>>
>> I'd be more in favor of lifting the 12GB limit than of reducing the
>> maximum segment size, for the reasons above. Raising the 12GB limit
>> has concrete and readily apparent benefits, whereas using bigger (or
>> smaller) segments is far more debatable. Yes, that will need a binary
>> search. But, I was hoping that could be a second (or third) patch, to
>> keep things simple, and benefits measurable.
>
> To me, it seems a bit short-sighted to say, OK, let's use a linear
> search because there's this 12GB limit so we can limit ourselves to 15
> segments.  Because somebody will want to remove that 12GB limit, and
> then we'll have to revisit the whole thing anyway.  I think, anyway.

Ok, attached an updated patch that implements the binary search

> What's not clear to me is how sensitive the performance of vacuum is
> to the number of cycles used here.  For a large index, the number of
> searches will presumably be quite large, so it does seem worth
> worrying about performance.  But if we just always used a binary
> search, would that lose enough performance with small numbers of
> segments that anyone would care?  If so, maybe we need to use linear
> search for small numbers of segments and switch to binary search with
> larger numbers of segments.

I just went and tested.

I implemented the hybrid binary search attached, and ran a few tests
with and without the sequential code enabled, at small scales.

The difference is statistically significant, but small (less than 3%).
With proper optimization of the binary search, however, the difference
flips:

claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
fullbinary.s100.times
vacuum_bench_s100.1.shufp80.log:CPU: user: 6.20 s, system: 1.42 s,
elapsed: 18.34 s.
vacuum_bench_s100.2.shufp80.log:CPU: user: 6.44 s, system: 1.40 s,
elapsed: 19.75 s.
vacuum_bench_s100.3.shufp80.log:CPU: user: 6.28 s, system: 1.41 s,
elapsed: 18.48 s.
vacuum_bench_s100.4.shufp80.log:CPU: user: 6.39 s, system: 1.51 s,
elapsed: 20.60 s.
vacuum_bench_s100.5.shufp80.log:CPU: user: 6.26 s, system: 1.42 s,
elapsed: 19.16 s.

claudiofreire@klaumpp:~/src/postgresql.vacuum> fgrep shufp80
hybridbinary.s100.times
vacuum_bench_s100.1.shufp80.log:CPU: user: 6.49 s, system: 1.39 s,
elapsed: 19.15 s.
vacuum_bench_s100.2.shufp80.log:CPU: user: 6.36 s, system: 1.33 s,
elapsed: 18.40 s.
vacuum_bench_s100.3.shufp80.log:CPU: user: 6.36 s, system: 1.31 s,
elapsed: 18.87 s.
vacuum_bench_s100.4.shufp80.log:CPU: user: 6.59 s, system: 1.35 s,
elapsed: 26.43 s.
vacuum_bench_s100.5.shufp80.log:CPU: user: 6.54 s, system: 1.28 s,
elapsed: 20.02 s.

That's after inlining the compare on both the linear and sequential
code, and it seems it lets the compiler optimize the binary search to
the point where it outperforms the sequential search.

That's not the case when the compare isn't inlined.

That seems in line with [1], that show the impact of various
optimizations on both algorithms. It's clearly a close enough race
that optimizations play a huge role.

Since we're not likely to go and implement SSE2-optimized versions, I
believe I'll leave the binary search only. That's the attached patch
set.

I'm running the full test suite, but that takes a very long while.
I'll post the results when they're done.

[1] https://schani.wordpress.com/2010/04/30/linear-vs-binary-search/
From 709428f67960dd20eaf34846a0b579766e0701f6 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 408 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 +
 3 files changed, 350 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5b43a66..ddf19d7 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -12,11 +12,25 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-12 Thread Robert Haas
On Tue, Apr 11, 2017 at 4:38 PM, Claudio Freire  wrote:
> In essence, the patch as it is proposed, doesn't *need* a binary
> search, because the segment list can only grow up to 15 segments at
> its biggest, and that's a size small enough that linear search will
> outperform (or at least perform as well as) binary search. Reducing
> the initial segment size wouldn't change that. If the 12GB limit is
> lifted, or the maximum segment size reduced (from 1GB to 128MB for
> example), however, that would change.
>
> I'd be more in favor of lifting the 12GB limit than of reducing the
> maximum segment size, for the reasons above. Raising the 12GB limit
> has concrete and readily apparent benefits, whereas using bigger (or
> smaller) segments is far more debatable. Yes, that will need a binary
> search. But, I was hoping that could be a second (or third) patch, to
> keep things simple, and benefits measurable.

To me, it seems a bit short-sighted to say, OK, let's use a linear
search because there's this 12GB limit so we can limit ourselves to 15
segments.  Because somebody will want to remove that 12GB limit, and
then we'll have to revisit the whole thing anyway.  I think, anyway.

What's not clear to me is how sensitive the performance of vacuum is
to the number of cycles used here.  For a large index, the number of
searches will presumably be quite large, so it does seem worth
worrying about performance.  But if we just always used a binary
search, would that lose enough performance with small numbers of
segments that anyone would care?  If so, maybe we need to use linear
search for small numbers of segments and switch to binary search with
larger numbers of segments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-11 Thread Claudio Freire
On Tue, Apr 11, 2017 at 4:17 PM, Robert Haas  wrote:
> On Tue, Apr 11, 2017 at 2:59 PM, Claudio Freire  
> wrote:
>> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas  wrote:
>>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
>>> maintenance_work_mem
>>>
>>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
>>> enough so we'll allocate another 8GB, for a total of 16256MB, but more
>>> than three-quarters of that last allocation ends up being wasted.
>>> I've been told on this list before that doubling is the one true way
>>> of increasing the size of an allocated chunk of memory, but I'm still
>>> a bit unconvinced.
>>
>> There you're wrong. The allocation is capped to 1GB, so wastage has an
>> upper bound of 1GB.
>
> Ah, OK.  Sorry, didn't really look at the code.  I stand corrected,
> but then it seems a bit strange to me that the largest and smallest
> allocations are only 8x different. I still don't really understand
> what that buys us.

Basically, attacking the problem (that, I think, you mentioned) of
very small systems in which overallocation for small vacuums was an
issue.

The "slow start" behavior of starting with smaller segments tries to
improve the situation for small vacuums, not big ones.

By starting at 128M and growing up to 1GB, overallocation is bound to
the range 128M-1GB and is proportional to the amount of dead tuples,
not table size, as it was before. Starting at 128M helps the initial
segment search, but I could readily go for starting at 64M, I don't
think it would make a huge difference. Removing exponential growth,
however, would.

As the patch stands, small systems (say 32-bit systems) without
overcommit and with slowly-changing data can now set high m_w_m
without running into overallocation issues with autovacuum reserving
too much virtual space, as it will reserve memory only proportional to
the amount of dead tuples. Previously, it would reserve all of m_w_m
regardless of whether it was needed or not, with the only exception
being really small tables, so m_w_m=1GB was unworkable in those cases.
Now it should be fine.

> What would we lose if we just made 'em all 128MB?

TBH, not that much. We'd need 8x compares to find the segment, that
forces a switch to binary search of the segments, which is less
cache-friendly. So it's more complex code, less cache locality. I'm
just not sure what's the benefit given current limits.

The only aim of this multiarray approach was making *virtual address
space reservations* proportional to the amount of actual memory
needed, as opposed to configured limits. It doesn't need to be a tight
fit, because calling palloc on its own doesn't actually use that
memory, at least on big allocations like these - the OS will not map
the memory pages until they're first touched. That's true in most
modern systems, and many ancient ones too.

In essence, the patch as it is proposed, doesn't *need* a binary
search, because the segment list can only grow up to 15 segments at
its biggest, and that's a size small enough that linear search will
outperform (or at least perform as well as) binary search. Reducing
the initial segment size wouldn't change that. If the 12GB limit is
lifted, or the maximum segment size reduced (from 1GB to 128MB for
example), however, that would change.

I'd be more in favor of lifting the 12GB limit than of reducing the
maximum segment size, for the reasons above. Raising the 12GB limit
has concrete and readily apparent benefits, whereas using bigger (or
smaller) segments is far more debatable. Yes, that will need a binary
search. But, I was hoping that could be a second (or third) patch, to
keep things simple, and benefits measurable.

Also, the plan as discussed in this very long thread, was to
eventually try to turn segments into bitmaps if dead tuple density was
big enough. That benefits considerably from big segments, since lookup
on a bitmap is O(1) - the bigger the segments, the faster the lookup,
as the search on the segment list would be dominant.

So... what shall we do?

At this point, I've given all my arguments for the current design. If
the more senior developers don't agree, I'll be happy to try your way.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-11 Thread Robert Haas
On Tue, Apr 11, 2017 at 2:59 PM, Claudio Freire  wrote:
> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas  wrote:
>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
>> maintenance_work_mem
>>
>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
>> enough so we'll allocate another 8GB, for a total of 16256MB, but more
>> than three-quarters of that last allocation ends up being wasted.
>> I've been told on this list before that doubling is the one true way
>> of increasing the size of an allocated chunk of memory, but I'm still
>> a bit unconvinced.
>
> There you're wrong. The allocation is capped to 1GB, so wastage has an
> upper bound of 1GB.

Ah, OK.  Sorry, didn't really look at the code.  I stand corrected,
but then it seems a bit strange to me that the largest and smallest
allocations are only 8x different.  I still don't really understand
what that buys us.  What would we lose if we just made 'em all 128MB?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-11 Thread Claudio Freire
On Tue, Apr 11, 2017 at 3:59 PM, Claudio Freire  wrote:
> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas  wrote:
>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
>> maintenance_work_mem
>>
>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
>> enough so we'll allocate another 8GB, for a total of 16256MB, but more
>> than three-quarters of that last allocation ends up being wasted.
>> I've been told on this list before that doubling is the one true way
>> of increasing the size of an allocated chunk of memory, but I'm still
>> a bit unconvinced.
>
> There you're wrong. The allocation is capped to 1GB, so wastage has an
> upper bound of 1GB.

And total m_w_m for vacuum is still capped to 12GB (as big you can get
with 32-bit integer indices).

So you can get at most 15 segments (a binary search is thus not worth
it), and overallocate by at most 1GB (the maximum segment size).

At least that's my rationale.

Removing the 12GB limit requires a bit of care (there are some 32-bit
counters still around I believe).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-11 Thread Claudio Freire
On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas  wrote:
> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
> maintenance_work_mem
>
> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
> enough so we'll allocate another 8GB, for a total of 16256MB, but more
> than three-quarters of that last allocation ends up being wasted.
> I've been told on this list before that doubling is the one true way
> of increasing the size of an allocated chunk of memory, but I'm still
> a bit unconvinced.

There you're wrong. The allocation is capped to 1GB, so wastage has an
upper bound of 1GB.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-11 Thread Robert Haas
On Fri, Apr 7, 2017 at 9:12 PM, Andres Freund  wrote:
>> Why do you say exponential growth fragments memory? AFAIK, all those
>> allocations are well beyond the point where malloc starts mmaping
>> memory, so each of those segments should be a mmap segment,
>> independently freeable.
>
> Not all platforms have that, and even on platforms with it, frequent,
> unevenly sized, very large allocations can lead to enough fragmentation
> that further allocations are harder and fragment / enlarge the
> pagetable.

Such a thing is completely outside my personal experience.  I've never
heard of a case where a 64-bit platform fails to allocate memory
because something (what?) is fragmented.  Page table memory usage is a
concern at some level, but probably less so for autovacuum workers
than for most backends, because autovacuum workers (where most
vacuuming is done) exit after one pass through pg_class.  Although I
think our memory footprint is a topic that could use more energy, I
don't really see any reason to think that pagetable bloat caused my
unevenly sized allocations in short-lived processes is the place to
start worrying.

That having been said, IIRC, I did propose quite a ways upthread that
we use a fixed chunk size, just because it would use less actual
memory, never mind the size of the page table.  I mean, if you
allocate in chunks of 64MB, which I think is what I proposed, you'll
never waste more than 64MB.  If you allocate in
exponentially-increasing chunk sizes starting at 128MB, you could
easily waste much more.  Let's imagine a 1TB table where 20% of the
tuples are dead due to some large bulk operation (a bulk load failed,
or a bulk delete succeeded, or a bulk update happened).  Back of the
envelope calculation:

1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
maintenance_work_mem

So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
enough so we'll allocate another 8GB, for a total of 16256MB, but more
than three-quarters of that last allocation ends up being wasted.
I've been told on this list before that doubling is the one true way
of increasing the size of an allocated chunk of memory, but I'm still
a bit unconvinced.

On the other hand, if we did allocate fixed chunks of, say, 64MB, we
could end up with an awful lot of them.  For example, in the example
above, 9216MB/64MB = 144 chunks.  Is that number of mappings going to
make the VM system unhappy on any of the platforms we care about?  Is
that a bigger or smaller problem than what you (Andres) are worrying
about?  I don't know.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-08 Thread David Steele
On 4/7/17 10:19 PM, Claudio Freire wrote:
> 
> I rebased the early free patch (patch 3) to apply on top of the v9
> patch 2 (it needed some changes). I recognize the early free patch
> didn't get nearly as much scrutiny, so I'm fine with commiting only 2
> if that one's ready to go but 3 isn't.
> 
> If it's decided to go for fixed 128M segments and a binary search of
> segments, I don't think I can get that ready and tested before the
> commitfest ends.

This submission has been moved to CF 2017-07.

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 10:06 PM, Claudio Freire  wrote:
>>> >> + if (seg->num_dead_tuples >= seg->max_dead_tuples)
>>> >> + {
>>> >> + /*
>>> >> +  * The segment is overflowing, so we must allocate 
>>> >> a new segment.
>>> >> +  * We could have a preallocated segment descriptor 
>>> >> already, in
>>> >> +  * which case we just reinitialize it, or we may 
>>> >> need to repalloc
>>> >> +  * the vacrelstats->dead_tuples array. In that 
>>> >> case, seg will no
>>> >> +  * longer be valid, so we must be careful about 
>>> >> that. In any case,
>>> >> +  * we must update the last_dead_tuple copy in the 
>>> >> overflowing
>>> >> +  * segment descriptor.
>>> >> +  */
>>> >> + Assert(seg->num_dead_tuples == 
>>> >> seg->max_dead_tuples);
>>> >> + seg->last_dead_tuple = 
>>> >> seg->dt_tids[seg->num_dead_tuples - 1];
>>> >> + if (vacrelstats->dead_tuples.last_seg + 1 >= 
>>> >> vacrelstats->dead_tuples.num_segs)
>>> >> + {
>>> >> + int new_num_segs = 
>>> >> vacrelstats->dead_tuples.num_segs * 2;
>>> >> +
>>> >> + vacrelstats->dead_tuples.dt_segments = 
>>> >> (DeadTuplesSegment *) repalloc(
>>> >> +(void *) 
>>> >> vacrelstats->dead_tuples.dt_segments,
>>> >> +
>>> >> new_num_segs * sizeof(DeadTuplesSegment));
>>> >
>>> > Might be worth breaking this into some sub-statements, it's quite hard
>>> > to read.
>>>
>>> Breaking what precisely? The comment?
>>
>> No, the three-line statement computing the new value of
>> dead_tuples.dt_segments.  I'd at least assign dead_tuples to a local
>> variable, to cut the length of the statement down.
>
> Ah, alright. Will try to do that.

Attached is an updated patch set with the requested changes.

Segment allocation still follows the exponential strategy, and segment
lookup is still linear.

I rebased the early free patch (patch 3) to apply on top of the v9
patch 2 (it needed some changes). I recognize the early free patch
didn't get nearly as much scrutiny, so I'm fine with commiting only 2
if that one's ready to go but 3 isn't.

If it's decided to go for fixed 128M segments and a binary search of
segments, I don't think I can get that ready and tested before the
commitfest ends.
From 9b8f90f19d558a7e6a32cb253d89819f7c300598 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 346 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 +
 3 files changed, 299 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..4f0cf1b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -12,11 +12,25 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure proceeds sequentially in the list of segments,
+ * and with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity (2 log N to be
+ * precise).
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 10:12 PM, Andres Freund  wrote:
> On 2017-04-07 22:06:13 -0300, Claudio Freire wrote:
>> On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund  wrote:
>> > Hi,
>> >
>> >
>> > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
>> >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund  wrote:
>> >> > Hi,
>> >> >
>> >> > I've *not* read the history of this thread.  So I really might be
>> >> > missing some context.
>> >> >
>> >> >
>> >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> >> >> From: Claudio Freire 
>> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>> >> >>
>> >> >> Turn the dead_tuples array into a structure composed of several
>> >> >> exponentially bigger arrays, to enable usage of more than 1GB
>> >> >> of work mem during vacuum and thus reduce the number of full
>> >> >> index scans necessary to remove all dead tids when the memory is
>> >> >> available.
>> >> >
>> >> >>   * We are willing to use at most maintenance_work_mem (or perhaps
>> >> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> >> >> - * initially allocate an array of TIDs of that size, with an upper 
>> >> >> limit that
>> >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit 
>> >> >> that
>> >> >>   * depends on table size (this limit ensures we don't allocate a huge 
>> >> >> area
>> >> >> - * uselessly for vacuuming small tables).  If the array threatens to 
>> >> >> overflow,
>> >> >> - * we suspend the heap scan phase and perform a pass of index cleanup 
>> >> >> and page
>> >> >> - * compaction, then resume the heap scan with an empty TID array.
>> >> >> + * uselessly for vacuuming small tables). Additional arrays of 
>> >> >> increasingly
>> >> >> + * large sizes are allocated as they become necessary.
>> >> >> + *
>> >> >> + * The TID array is thus represented as a list of multiple segments of
>> >> >> + * varying size, beginning with the initial size of up to 128MB, and 
>> >> >> growing
>> >> >> + * exponentially until the whole budget of 
>> >> >> (autovacuum_)maintenance_work_mem
>> >> >> + * is used up.
>> >> >
>> >> > When the chunk size is 128MB, I'm a bit unconvinced that using
>> >> > exponential growth is worth it. The allocator overhead can't be
>> >> > meaningful in comparison to collecting 128MB dead tuples, the potential
>> >> > waste is pretty big, and it increases memory fragmentation.
>> >>
>> >> The exponential strategy is mainly to improve lookup time (ie: to
>> >> avoid large segment lists).
>> >
>> > Well, if we were to do binary search on the segment list, that'd not be
>> > necessary.
>>
>> True, but the initial lookup might be slower in the end, since the
>> array would be bigger and cache locality worse.
>>
>> Why do you say exponential growth fragments memory? AFAIK, all those
>> allocations are well beyond the point where malloc starts mmaping
>> memory, so each of those segments should be a mmap segment,
>> independently freeable.
>
> Not all platforms have that, and even on platforms with it, frequent,
> unevenly sized, very large allocations can lead to enough fragmentation
> that further allocations are harder and fragment / enlarge the
> pagetable.

I wouldn't call this frequent. You can get at most slightly more than
a dozen such allocations given the current limits.
And allocation sizes are quite regular - you get 128M or multiples of
128M, so each free block can be reused for N smaller allocations if
needed. I don't think it has much potential to fragment memory.

This isn't significantly different from tuplesort or any other code
that can do big allocations, and the differences favor less
fragmentation than those, so I don't see why this would need special
treatment.

My point being that it's not been simple getting to a point where this
beats even in CPU time the original single binary search.
If we're to scrap this implementation and go for a double binary
search, I'd like to have a clear measurable benefit to chase from
doing so. Fragmentation is hard to measure, and I cannot get CPU-bound
vacuums on the test hardware I have to test lookup performance at big
scales.

>> >> Yes, the benchmarks are upthread. The earlier runs were run on my
>> >> laptop and made little sense, so I'd ignore them as inaccurate. The
>> >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
>> >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw
>> >> improvements as well, roughly 30%, though it's not measuring CPU time
>> >> but rather elapsed time.
>> >
>> > I'd be more concerned about cases that'd already fit into memory, not ones
>> > where we avoid doing another scan - and I think you mostly measured that?
>> >
>> > - Andres
>>
>> Well, scale 400 is pretty much as big as you can get with the old 1GB
>> limit, and also suffered no 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Andres Freund
On 2017-04-07 22:06:13 -0300, Claudio Freire wrote:
> On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund  wrote:
> > Hi,
> >
> >
> > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
> >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund  wrote:
> >> > Hi,
> >> >
> >> > I've *not* read the history of this thread.  So I really might be
> >> > missing some context.
> >> >
> >> >
> >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
> >> >> From: Claudio Freire 
> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
> >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
> >> >>
> >> >> Turn the dead_tuples array into a structure composed of several
> >> >> exponentially bigger arrays, to enable usage of more than 1GB
> >> >> of work mem during vacuum and thus reduce the number of full
> >> >> index scans necessary to remove all dead tids when the memory is
> >> >> available.
> >> >
> >> >>   * We are willing to use at most maintenance_work_mem (or perhaps
> >> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
> >> >> - * initially allocate an array of TIDs of that size, with an upper 
> >> >> limit that
> >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
> >> >>   * depends on table size (this limit ensures we don't allocate a huge 
> >> >> area
> >> >> - * uselessly for vacuuming small tables).  If the array threatens to 
> >> >> overflow,
> >> >> - * we suspend the heap scan phase and perform a pass of index cleanup 
> >> >> and page
> >> >> - * compaction, then resume the heap scan with an empty TID array.
> >> >> + * uselessly for vacuuming small tables). Additional arrays of 
> >> >> increasingly
> >> >> + * large sizes are allocated as they become necessary.
> >> >> + *
> >> >> + * The TID array is thus represented as a list of multiple segments of
> >> >> + * varying size, beginning with the initial size of up to 128MB, and 
> >> >> growing
> >> >> + * exponentially until the whole budget of 
> >> >> (autovacuum_)maintenance_work_mem
> >> >> + * is used up.
> >> >
> >> > When the chunk size is 128MB, I'm a bit unconvinced that using
> >> > exponential growth is worth it. The allocator overhead can't be
> >> > meaningful in comparison to collecting 128MB dead tuples, the potential
> >> > waste is pretty big, and it increases memory fragmentation.
> >>
> >> The exponential strategy is mainly to improve lookup time (ie: to
> >> avoid large segment lists).
> >
> > Well, if we were to do binary search on the segment list, that'd not be
> > necessary.
> 
> True, but the initial lookup might be slower in the end, since the
> array would be bigger and cache locality worse.
> 
> Why do you say exponential growth fragments memory? AFAIK, all those
> allocations are well beyond the point where malloc starts mmaping
> memory, so each of those segments should be a mmap segment,
> independently freeable.

Not all platforms have that, and even on platforms with it, frequent,
unevenly sized, very large allocations can lead to enough fragmentation
that further allocations are harder and fragment / enlarge the
pagetable.


> >> Yes, the benchmarks are upthread. The earlier runs were run on my
> >> laptop and made little sense, so I'd ignore them as inaccurate. The
> >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
> >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw
> >> improvements as well, roughly 30%, though it's not measuring CPU time
> >> but rather elapsed time.
> >
> > I'd be more concerned about cases that'd already fit into memory, not ones
> > where we avoid doing another scan - and I think you mostly measured that?
> >
> > - Andres
> 
> Well, scale 400 is pretty much as big as you can get with the old 1GB
> limit, and also suffered no significant regression. Although, true, id
> didn't significantly improve either.

Aren't more interesting cases those where not that many dead tuples are
found, but the indexes are pretty large?  IIRC the index vacuum scans
still visit every leaf index tuple, no?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund  wrote:
> Hi,
>
>
> On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
>> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund  wrote:
>> > Hi,
>> >
>> > I've *not* read the history of this thread.  So I really might be
>> > missing some context.
>> >
>> >
>> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> >> From: Claudio Freire 
>> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>> >>
>> >> Turn the dead_tuples array into a structure composed of several
>> >> exponentially bigger arrays, to enable usage of more than 1GB
>> >> of work mem during vacuum and thus reduce the number of full
>> >> index scans necessary to remove all dead tids when the memory is
>> >> available.
>> >
>> >>   * We are willing to use at most maintenance_work_mem (or perhaps
>> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> >> - * initially allocate an array of TIDs of that size, with an upper limit 
>> >> that
>> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>> >>   * depends on table size (this limit ensures we don't allocate a huge 
>> >> area
>> >> - * uselessly for vacuuming small tables).  If the array threatens to 
>> >> overflow,
>> >> - * we suspend the heap scan phase and perform a pass of index cleanup 
>> >> and page
>> >> - * compaction, then resume the heap scan with an empty TID array.
>> >> + * uselessly for vacuuming small tables). Additional arrays of 
>> >> increasingly
>> >> + * large sizes are allocated as they become necessary.
>> >> + *
>> >> + * The TID array is thus represented as a list of multiple segments of
>> >> + * varying size, beginning with the initial size of up to 128MB, and 
>> >> growing
>> >> + * exponentially until the whole budget of 
>> >> (autovacuum_)maintenance_work_mem
>> >> + * is used up.
>> >
>> > When the chunk size is 128MB, I'm a bit unconvinced that using
>> > exponential growth is worth it. The allocator overhead can't be
>> > meaningful in comparison to collecting 128MB dead tuples, the potential
>> > waste is pretty big, and it increases memory fragmentation.
>>
>> The exponential strategy is mainly to improve lookup time (ie: to
>> avoid large segment lists).
>
> Well, if we were to do binary search on the segment list, that'd not be
> necessary.

True, but the initial lookup might be slower in the end, since the
array would be bigger and cache locality worse.

Why do you say exponential growth fragments memory? AFAIK, all those
allocations are well beyond the point where malloc starts mmaping
memory, so each of those segments should be a mmap segment,
independently freeable.

>> >> + if (seg->num_dead_tuples >= seg->max_dead_tuples)
>> >> + {
>> >> + /*
>> >> +  * The segment is overflowing, so we must allocate 
>> >> a new segment.
>> >> +  * We could have a preallocated segment descriptor 
>> >> already, in
>> >> +  * which case we just reinitialize it, or we may 
>> >> need to repalloc
>> >> +  * the vacrelstats->dead_tuples array. In that 
>> >> case, seg will no
>> >> +  * longer be valid, so we must be careful about 
>> >> that. In any case,
>> >> +  * we must update the last_dead_tuple copy in the 
>> >> overflowing
>> >> +  * segment descriptor.
>> >> +  */
>> >> + Assert(seg->num_dead_tuples == 
>> >> seg->max_dead_tuples);
>> >> + seg->last_dead_tuple = 
>> >> seg->dt_tids[seg->num_dead_tuples - 1];
>> >> + if (vacrelstats->dead_tuples.last_seg + 1 >= 
>> >> vacrelstats->dead_tuples.num_segs)
>> >> + {
>> >> + int new_num_segs = 
>> >> vacrelstats->dead_tuples.num_segs * 2;
>> >> +
>> >> + vacrelstats->dead_tuples.dt_segments = 
>> >> (DeadTuplesSegment *) repalloc(
>> >> +(void *) 
>> >> vacrelstats->dead_tuples.dt_segments,
>> >> +
>> >> new_num_segs * sizeof(DeadTuplesSegment));
>> >
>> > Might be worth breaking this into some sub-statements, it's quite hard
>> > to read.
>>
>> Breaking what precisely? The comment?
>
> No, the three-line statement computing the new value of
> dead_tuples.dt_segments.  I'd at least assign dead_tuples to a local
> variable, to cut the length of the statement down.

Ah, alright. Will try to do that.

>> >> +/*
>> >>   *   lazy_tid_reaped() -- is a particular tid deletable?
>> >>   *
>> >>   *   This has the right signature to be an 
>> >> IndexBulkDeleteCallback.
>> >>   *
>> >> - *   Assumes 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Andres Freund
Hi,


On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund  wrote:
> > Hi,
> >
> > I've *not* read the history of this thread.  So I really might be
> > missing some context.
> >
> >
> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
> >> From: Claudio Freire 
> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
> >>
> >> Turn the dead_tuples array into a structure composed of several
> >> exponentially bigger arrays, to enable usage of more than 1GB
> >> of work mem during vacuum and thus reduce the number of full
> >> index scans necessary to remove all dead tids when the memory is
> >> available.
> >
> >>   * We are willing to use at most maintenance_work_mem (or perhaps
> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
> >> - * initially allocate an array of TIDs of that size, with an upper limit 
> >> that
> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
> >>   * depends on table size (this limit ensures we don't allocate a huge area
> >> - * uselessly for vacuuming small tables).  If the array threatens to 
> >> overflow,
> >> - * we suspend the heap scan phase and perform a pass of index cleanup and 
> >> page
> >> - * compaction, then resume the heap scan with an empty TID array.
> >> + * uselessly for vacuuming small tables). Additional arrays of 
> >> increasingly
> >> + * large sizes are allocated as they become necessary.
> >> + *
> >> + * The TID array is thus represented as a list of multiple segments of
> >> + * varying size, beginning with the initial size of up to 128MB, and 
> >> growing
> >> + * exponentially until the whole budget of 
> >> (autovacuum_)maintenance_work_mem
> >> + * is used up.
> >
> > When the chunk size is 128MB, I'm a bit unconvinced that using
> > exponential growth is worth it. The allocator overhead can't be
> > meaningful in comparison to collecting 128MB dead tuples, the potential
> > waste is pretty big, and it increases memory fragmentation.
> 
> The exponential strategy is mainly to improve lookup time (ie: to
> avoid large segment lists).

Well, if we were to do binary search on the segment list, that'd not be
necessary.

> >> + if (seg->num_dead_tuples >= seg->max_dead_tuples)
> >> + {
> >> + /*
> >> +  * The segment is overflowing, so we must allocate a 
> >> new segment.
> >> +  * We could have a preallocated segment descriptor 
> >> already, in
> >> +  * which case we just reinitialize it, or we may 
> >> need to repalloc
> >> +  * the vacrelstats->dead_tuples array. In that case, 
> >> seg will no
> >> +  * longer be valid, so we must be careful about 
> >> that. In any case,
> >> +  * we must update the last_dead_tuple copy in the 
> >> overflowing
> >> +  * segment descriptor.
> >> +  */
> >> + Assert(seg->num_dead_tuples == seg->max_dead_tuples);
> >> + seg->last_dead_tuple = 
> >> seg->dt_tids[seg->num_dead_tuples - 1];
> >> + if (vacrelstats->dead_tuples.last_seg + 1 >= 
> >> vacrelstats->dead_tuples.num_segs)
> >> + {
> >> + int new_num_segs = 
> >> vacrelstats->dead_tuples.num_segs * 2;
> >> +
> >> + vacrelstats->dead_tuples.dt_segments = 
> >> (DeadTuplesSegment *) repalloc(
> >> +(void *) 
> >> vacrelstats->dead_tuples.dt_segments,
> >> +
> >> new_num_segs * sizeof(DeadTuplesSegment));
> >
> > Might be worth breaking this into some sub-statements, it's quite hard
> > to read.
> 
> Breaking what precisely? The comment?

No, the three-line statement computing the new value of
dead_tuples.dt_segments.  I'd at least assign dead_tuples to a local
variable, to cut the length of the statement down.


> >> + while (vacrelstats->dead_tuples.num_segs < 
> >> new_num_segs)
> >> + {
> >> + /* Initialize as "unallocated" */
> >> + DeadTuplesSegment *nseg = 
> >> &(vacrelstats->dead_tuples.dt_segments[
> >> +  
> >> vacrelstats->dead_tuples.num_segs]);
> >
> > dito.
> 
> I don't really get what you're asking here.

Trying to simplify/shorten the statement.


> >> +/*
> >>   *   lazy_tid_reaped() -- is a particular tid deletable?
> >>   *
> >>   *   This has the right signature to be an 
> >> IndexBulkDeleteCallback.
> >>   *
> >> - *   Assumes 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund  wrote:
> Hi,
>
> I've *not* read the history of this thread.  So I really might be
> missing some context.
>
>
>> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> From: Claudio Freire 
>> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>>
>> Turn the dead_tuples array into a structure composed of several
>> exponentially bigger arrays, to enable usage of more than 1GB
>> of work mem during vacuum and thus reduce the number of full
>> index scans necessary to remove all dead tids when the memory is
>> available.
>
>>   * We are willing to use at most maintenance_work_mem (or perhaps
>>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> - * initially allocate an array of TIDs of that size, with an upper limit 
>> that
>> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>>   * depends on table size (this limit ensures we don't allocate a huge area
>> - * uselessly for vacuuming small tables).  If the array threatens to 
>> overflow,
>> - * we suspend the heap scan phase and perform a pass of index cleanup and 
>> page
>> - * compaction, then resume the heap scan with an empty TID array.
>> + * uselessly for vacuuming small tables). Additional arrays of increasingly
>> + * large sizes are allocated as they become necessary.
>> + *
>> + * The TID array is thus represented as a list of multiple segments of
>> + * varying size, beginning with the initial size of up to 128MB, and growing
>> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
>> + * is used up.
>
> When the chunk size is 128MB, I'm a bit unconvinced that using
> exponential growth is worth it. The allocator overhead can't be
> meaningful in comparison to collecting 128MB dead tuples, the potential
> waste is pretty big, and it increases memory fragmentation.

The exponential strategy is mainly to improve lookup time (ie: to
avoid large segment lists).

>> + * Lookup in that structure proceeds sequentially in the list of segments,
>> + * and with a binary search within each segment. Since segment's size grows
>> + * exponentially, this retains O(N log N) lookup complexity.
>
> N log N is a horrible lookup complexity.  That's the complexity of
> *sorting* an entire array.  I think you might be trying to argue that
> it's log(N) * log(N)? Once log(n) for the exponentially growing size of
> segments, one for the binary search?
>
> Afaics you could quite easily make it O(2 log(N)) by simply also doing
> binary search over the segments.  Might not be worth it due to the small
> constant involved normally.

It's a typo, yes, I meant O(log N) (which is equivalent to O(2 log N))

>> + * If the array threatens to overflow, we suspend the heap scan phase and
>> + * perform a pass of index cleanup and page compaction, then resume the heap
>> + * scan with an array of logically empty but already preallocated TID 
>> segments
>> + * to be refilled with more dead tuple TIDs.
>
> Hm, it's not really the array that overflows, it's m_w_m that'd be
> exceeded, right?

Yes, will rephrase. Although that's how the original comment expressed
the same concept.

>>  /*
>> + * Minimum (starting) size of the dead_tuples array segments. Will allocate
>> + * space for 128MB worth of tid pointers in the first segment, further 
>> segments
>> + * will grow in size exponentially. Don't make it too small or the segment 
>> list
>> + * will grow bigger than the sweetspot for search efficiency on big vacuums.
>> + */
>> +#define LAZY_MIN_TUPLES  Max(MaxHeapTuplesPerPage, (128<<20) / 
>> sizeof(ItemPointerData))
>
> That's not really the minimum, no? s/MIN/INIT/?

Ok

>> +typedef struct DeadTuplesSegment
>> +{
>> + int num_dead_tuples;/* # of entries in the 
>> segment */
>> + int max_dead_tuples;/* # of entries 
>> allocated in the segment */
>> + ItemPointerData last_dead_tuple;/* Copy of the last dead tuple 
>> (unset
>> +
>>   * until the segment is fully
>> +
>>   * populated) */
>> + unsigned short padding;
>> + ItemPointer dt_tids;/* Array of dead tuples */
>> +}DeadTuplesSegment;
>
> Whenever padding is needed, it should have an explanatory comment.  It's
> certainly not obvious to me wh it's neede here.

Ok

>> @@ -1598,6 +1657,11 @@ lazy_vacuum_index(Relation indrel,
>>   ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
>>   ivinfo.strategy = vac_strategy;
>>
>> + /* Finalize the current segment by setting its upper bound dead tuple 
>> */
>> + seg = DeadTuplesCurrentSegment(vacrelstats);
>> + if (seg->num_dead_tuples > 0)
>> + 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Claudio Freire
On Fri, Apr 7, 2017 at 7:43 PM, Claudio Freire  wrote:
>>> + * Lookup in that structure proceeds sequentially in the list of segments,
>>> + * and with a binary search within each segment. Since segment's size grows
>>> + * exponentially, this retains O(N log N) lookup complexity.
>>
>> N log N is a horrible lookup complexity.  That's the complexity of
>> *sorting* an entire array.  I think you might be trying to argue that
>> it's log(N) * log(N)? Once log(n) for the exponentially growing size of
>> segments, one for the binary search?
>>
>> Afaics you could quite easily make it O(2 log(N)) by simply also doing
>> binary search over the segments.  Might not be worth it due to the small
>> constant involved normally.
>
> It's a typo, yes, I meant O(log N) (which is equivalent to O(2 log N))


To clarify, lookup over the segments is linear, so it's O(M) with M
the number of segments, then the binary search is O(log N) with N the
number of dead tuples.

So lookup is O(M + log N), but M < log N because of the segment's
exponential growth, therefore the lookup is O(2 log N)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-04-07 Thread Andres Freund
Hi,

I've *not* read the history of this thread.  So I really might be
missing some context.


> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
> From: Claudio Freire 
> Date: Mon, 12 Sep 2016 23:36:42 -0300
> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
> 
> Turn the dead_tuples array into a structure composed of several
> exponentially bigger arrays, to enable usage of more than 1GB
> of work mem during vacuum and thus reduce the number of full
> index scans necessary to remove all dead tids when the memory is
> available.

>   * We are willing to use at most maintenance_work_mem (or perhaps
>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
> - * initially allocate an array of TIDs of that size, with an upper limit that
> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>   * depends on table size (this limit ensures we don't allocate a huge area
> - * uselessly for vacuuming small tables).  If the array threatens to 
> overflow,
> - * we suspend the heap scan phase and perform a pass of index cleanup and 
> page
> - * compaction, then resume the heap scan with an empty TID array.
> + * uselessly for vacuuming small tables). Additional arrays of increasingly
> + * large sizes are allocated as they become necessary.
> + *
> + * The TID array is thus represented as a list of multiple segments of
> + * varying size, beginning with the initial size of up to 128MB, and growing
> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
> + * is used up.

When the chunk size is 128MB, I'm a bit unconvinced that using
exponential growth is worth it. The allocator overhead can't be
meaningful in comparison to collecting 128MB dead tuples, the potential
waste is pretty big, and it increases memory fragmentation.


> + * Lookup in that structure proceeds sequentially in the list of segments,
> + * and with a binary search within each segment. Since segment's size grows
> + * exponentially, this retains O(N log N) lookup complexity.

N log N is a horrible lookup complexity.  That's the complexity of
*sorting* an entire array.  I think you might be trying to argue that
it's log(N) * log(N)? Once log(n) for the exponentially growing size of
segments, one for the binary search?

Afaics you could quite easily make it O(2 log(N)) by simply also doing
binary search over the segments.  Might not be worth it due to the small
constant involved normally.


> + * If the array threatens to overflow, we suspend the heap scan phase and
> + * perform a pass of index cleanup and page compaction, then resume the heap
> + * scan with an array of logically empty but already preallocated TID 
> segments
> + * to be refilled with more dead tuple TIDs.

Hm, it's not really the array that overflows, it's m_w_m that'd be
exceeded, right?


>  /*
> + * Minimum (starting) size of the dead_tuples array segments. Will allocate
> + * space for 128MB worth of tid pointers in the first segment, further 
> segments
> + * will grow in size exponentially. Don't make it too small or the segment 
> list
> + * will grow bigger than the sweetspot for search efficiency on big vacuums.
> + */
> +#define LAZY_MIN_TUPLES  Max(MaxHeapTuplesPerPage, (128<<20) / 
> sizeof(ItemPointerData))

That's not really the minimum, no? s/MIN/INIT/?


> +typedef struct DeadTuplesSegment
> +{
> + int num_dead_tuples;/* # of entries in the 
> segment */
> + int max_dead_tuples;/* # of entries 
> allocated in the segment */
> + ItemPointerData last_dead_tuple;/* Copy of the last dead tuple 
> (unset
> + 
>  * until the segment is fully
> + 
>  * populated) */
> + unsigned short padding;
> + ItemPointer dt_tids;/* Array of dead tuples */
> +}DeadTuplesSegment;

Whenever padding is needed, it should have an explanatory comment.  It's
certainly not obvious to me wh it's neede here.


> @@ -1598,6 +1657,11 @@ lazy_vacuum_index(Relation indrel,
>   ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
>   ivinfo.strategy = vac_strategy;
>  
> + /* Finalize the current segment by setting its upper bound dead tuple */
> + seg = DeadTuplesCurrentSegment(vacrelstats);
> + if (seg->num_dead_tuples > 0)
> + seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];

Why don't we just maintain this here, for all of the segments?  Seems a
bit easier.


> @@ -1973,7 +2037,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats 
> *vacrelstats)
>  static void
>  lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>  {
> - longmaxtuples;
> + longmaxtuples,
> + mintuples;
>   int   

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-03-28 Thread Claudio Freire
On Wed, Feb 1, 2017 at 7:55 PM, Claudio Freire  wrote:
> On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada  wrote:
>> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire  
>> wrote:
>>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada  
>>> wrote:
 Thank you for updating the patch.

 Whole patch looks good to me except for the following one comment.
 This is the final comment from me.

 /*
  *  lazy_tid_reaped() -- is a particular tid deletable?
  *
  *  This has the right signature to be an IndexBulkDeleteCallback.
  *
  *  Assumes dead_tuples array is in sorted order.
  */
 static bool
 lazy_tid_reaped(ItemPointer itemptr, void *state)
 {
 LVRelStats *vacrelstats = (LVRelStats *) state;

 You might want to update the comment of lazy_tid_reaped() as well.
>>>
>>> I don't see the mismatch with reality there (if you consider
>>> "dead_tples array" in the proper context, that is, the multiarray).
>>>
>>> What in particular do you find out of sync there?
>>
>> The current lazy_tid_reaped just find a tid from a tid array using
>> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
>> and processing method become complicated. So I thought it's better to
>> add the description of this function.
>
> Alright, updated with some more remarks that seemed relevant

I just realized I never updated the early free patch after the
multiarray version.

So attached is a patch that frees the multiarray as early as possible
(just after finishing with index bulk deletes, right before doing
index cleanup and attempting truncation).

This should make the possibly big amount of memory available to other
processes for the duration of those tasks, which could be a long time
in some cases.
From f080f8377b7200ae9c2a02abfb0ef0bf6e2d5d69 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Tue, 28 Mar 2017 22:40:39 -0300
Subject: [PATCH] Vacuum: free dead tuples array as early as possible

Allow other parts of the system to benefit from the possibly
large amount of memory reserved for dead tuples after they're
no longer necessary.

While the memory would be freed when the command finishes, it
can sometimes be some considerable time between the time vacuum
is done with the array until the command finishes - mostly due
to truncate taking a long time.
---
 src/backend/commands/vacuumlazy.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4a6895c..60a6c18 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -206,6 +206,7 @@ static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
 static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
 	   ItemPointer itemptr);
 static void lazy_clear_dead_tuples(LVRelStats *vacrelstats);
+static void lazy_free_dead_tuples(LVRelStats *vacrelstats);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static inline int vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
@@ -1358,6 +1359,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
  PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
 
+	/* dead tuples no longer needed past this point */
+	lazy_free_dead_tuples(vacrelstats);
+
 	/* Do post-vacuum cleanup and statistics update for each index */
 	for (i = 0; i < nindexes; i++)
 		lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
@@ -2165,6 +2169,24 @@ lazy_clear_dead_tuples(LVRelStats *vacrelstats)
 }
 
 /*
+ * lazy_free_dead_tuples - reset all dead tuples segments
+ * and free all allocated memory
+ */
+static void
+lazy_free_dead_tuples(LVRelStats *vacrelstats)
+{
+	int			nseg;
+
+	for (nseg = 0; nseg < vacrelstats->dead_tuples.num_segs; nseg++)
+		pfree(vacrelstats->dead_tuples.dt_segments[nseg].dt_tids);
+	pfree(vacrelstats->dead_tuples.dt_segments);
+	vacrelstats->dead_tuples.last_seg = 0;
+	vacrelstats->dead_tuples.num_segs = 0;
+	vacrelstats->dead_tuples.num_entries = 0;
+	vacrelstats->dead_tuples.dt_segments = NULL;
+}
+
+/*
  *	lazy_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
-- 
1.8.4.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-02-04 Thread Masahiko Sawada
On Wed, Feb 1, 2017 at 11:55 PM, Claudio Freire  wrote:
> On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada  wrote:
>> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire  
>> wrote:
>>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada  
>>> wrote:
 Thank you for updating the patch.

 Whole patch looks good to me except for the following one comment.
 This is the final comment from me.

 /*
  *  lazy_tid_reaped() -- is a particular tid deletable?
  *
  *  This has the right signature to be an IndexBulkDeleteCallback.
  *
  *  Assumes dead_tuples array is in sorted order.
  */
 static bool
 lazy_tid_reaped(ItemPointer itemptr, void *state)
 {
 LVRelStats *vacrelstats = (LVRelStats *) state;

 You might want to update the comment of lazy_tid_reaped() as well.
>>>
>>> I don't see the mismatch with reality there (if you consider
>>> "dead_tples array" in the proper context, that is, the multiarray).
>>>
>>> What in particular do you find out of sync there?
>>
>> The current lazy_tid_reaped just find a tid from a tid array using
>> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
>> and processing method become complicated. So I thought it's better to
>> add the description of this function.
>
> Alright, updated with some more remarks that seemed relevant

Thank you for updating the patch.

The patch looks good to me. There is no review comment from me.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-02-01 Thread Claudio Freire
On Wed, Feb 1, 2017 at 6:13 PM, Masahiko Sawada  wrote:
> On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire  
> wrote:
>> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada  
>> wrote:
>>> Thank you for updating the patch.
>>>
>>> Whole patch looks good to me except for the following one comment.
>>> This is the final comment from me.
>>>
>>> /*
>>>  *  lazy_tid_reaped() -- is a particular tid deletable?
>>>  *
>>>  *  This has the right signature to be an IndexBulkDeleteCallback.
>>>  *
>>>  *  Assumes dead_tuples array is in sorted order.
>>>  */
>>> static bool
>>> lazy_tid_reaped(ItemPointer itemptr, void *state)
>>> {
>>> LVRelStats *vacrelstats = (LVRelStats *) state;
>>>
>>> You might want to update the comment of lazy_tid_reaped() as well.
>>
>> I don't see the mismatch with reality there (if you consider
>> "dead_tples array" in the proper context, that is, the multiarray).
>>
>> What in particular do you find out of sync there?
>
> The current lazy_tid_reaped just find a tid from a tid array using
> bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
> and processing method become complicated. So I thought it's better to
> add the description of this function.

Alright, updated with some more remarks that seemed relevant
From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 318 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 ++
 3 files changed, 271 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..4a6895c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -12,11 +12,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
- * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure proceeds sequentially in the list of segments,
+ * and with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(N log N) lookup complexity.
+ *
+ * If the array threatens to overflow, we suspend the heap scan phase and
+ * perform a pass of index cleanup and page compaction, then resume the heap
+ * scan with an array of logically empty but already preallocated TID segments
+ * to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -93,6 +106,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -104,6 +125,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated) */
+	unsigned short padding;
+	ItemPointer dt_tids;	/* Array of dead tuples */

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-02-01 Thread Masahiko Sawada
On Wed, Feb 1, 2017 at 10:02 PM, Claudio Freire  wrote:
> On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada  wrote:
>> Thank you for updating the patch.
>>
>> Whole patch looks good to me except for the following one comment.
>> This is the final comment from me.
>>
>> /*
>>  *  lazy_tid_reaped() -- is a particular tid deletable?
>>  *
>>  *  This has the right signature to be an IndexBulkDeleteCallback.
>>  *
>>  *  Assumes dead_tuples array is in sorted order.
>>  */
>> static bool
>> lazy_tid_reaped(ItemPointer itemptr, void *state)
>> {
>> LVRelStats *vacrelstats = (LVRelStats *) state;
>>
>> You might want to update the comment of lazy_tid_reaped() as well.
>
> I don't see the mismatch with reality there (if you consider
> "dead_tples array" in the proper context, that is, the multiarray).
>
> What in particular do you find out of sync there?

The current lazy_tid_reaped just find a tid from a tid array using
bsearch but in your patch lazy_tid_reaped handles multiple tid arrays
and processing method become complicated. So I thought it's better to
add the description of this function.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-02-01 Thread Claudio Freire
On Wed, Feb 1, 2017 at 5:47 PM, Masahiko Sawada  wrote:
> Thank you for updating the patch.
>
> Whole patch looks good to me except for the following one comment.
> This is the final comment from me.
>
> /*
>  *  lazy_tid_reaped() -- is a particular tid deletable?
>  *
>  *  This has the right signature to be an IndexBulkDeleteCallback.
>  *
>  *  Assumes dead_tuples array is in sorted order.
>  */
> static bool
> lazy_tid_reaped(ItemPointer itemptr, void *state)
> {
> LVRelStats *vacrelstats = (LVRelStats *) state;
>
> You might want to update the comment of lazy_tid_reaped() as well.

I don't see the mismatch with reality there (if you consider
"dead_tples array" in the proper context, that is, the multiarray).

What in particular do you find out of sync there?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-02-01 Thread Masahiko Sawada
On Tue, Jan 31, 2017 at 3:05 AM, Claudio Freire  wrote:
> On Mon, Jan 30, 2017 at 5:51 AM, Masahiko Sawada  
> wrote:
>> 
>>  * We are willing to use at most maintenance_work_mem (or perhaps
>>  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>>  * initially allocate an array of TIDs of that size, with an upper limit that
>>  * depends on table size (this limit ensures we don't allocate a huge area
>>  * uselessly for vacuuming small tables).  If the array threatens to 
>> overflow,
>>
>> I think that we need to update the above paragraph comment at top of
>> vacuumlazy.c file.
>
> Indeed, I missed that one. Fixing.
>
>>
>> 
>> +   numtuples = Max(numtuples,
>> MaxHeapTuplesPerPage);
>> +   numtuples = Min(numtuples, INT_MAX / 2);
>> +   numtuples = Min(numtuples, 2 *
>> pseg->max_dead_tuples);
>> +   numtuples = Min(numtuples,
>> MaxAllocSize / sizeof(ItemPointerData));
>> +   seg->dt_tids = (ItemPointer)
>> palloc(sizeof(ItemPointerData) * numtuples);
>>
>> Why numtuples is limited to "INT_MAX / 2" but not INT_MAX?
>
> I forgot to mention this one in the OP.
>
> Googling around, I found out some implemetations of bsearch break with
> array sizes beyond INT_MAX/2 [1] (they'd overflow when computing the
> midpoint).
>
> Before this patch, this bsearch call had no way of reaching that size.
> An initial version of the patch (the one that allocated a big array
> with huge allocation) could reach that point, though, so I reduced the
> limit to play it safe. This latest version is back to the starting
> point, since it cannot allocate segments bigger than 1GB, but I opted
> to keep playing it safe and leave the reduced limit just in case.
>

Thanks, I understood.

>> 
>> @@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats
>> *vacrelstats)
>> pg_rusage_init();
>> npages = 0;
>>
>> -   tupindex = 0;
>> -   while (tupindex < vacrelstats->num_dead_tuples)
>> +   segindex = 0;
>> +   tottuples = 0;
>> +   for (segindex = tupindex = 0; segindex <=
>> vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)
>> {
>> -   BlockNumber tblk;
>> -   Buffer  buf;
>> -   Pagepage;
>> -   Sizefreespace;
>>
>> This is a minute thing but tupindex can be define inside of for loop.
>
> Right, changing.
>
>>
>> 
>> @@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options,
>> LVRelStats *vacrelstats,
>>   * instead of doing a second scan.
>>   */
>>  if (nindexes == 0 &&
>> -vacrelstats->num_dead_tuples > 0)
>> +vacrelstats->dead_tuples.num_entries > 0)
>>  {
>>  /* Remove tuples from heap */
>> -lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, );
>> +Assert(vacrelstats->dead_tuples.last_seg == 0);/*
>> Should not need more
>> + *
>> than one segment per
>> + * page */
>>
>> I'm not sure we need to add Assert() here but it seems to me that the
>> comment and code is not properly correspond and the comment for
>> Assert() should be wrote above of Assert() line.
>
> Well, that assert is the one that found the second bug in
> lazy_clear_dead_tuples, so clearly it's not without merit.
>
> I'll rearrange the comments as you ask though.
>
>
> Updated and rebased v7 attached.
>
>
> [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=776671

Thank you for updating the patch.

Whole patch looks good to me except for the following one comment.
This is the final comment from me.

/*
 *  lazy_tid_reaped() -- is a particular tid deletable?
 *
 *  This has the right signature to be an IndexBulkDeleteCallback.
 *
 *  Assumes dead_tuples array is in sorted order.
 */
static bool
lazy_tid_reaped(ItemPointer itemptr, void *state)
{
LVRelStats *vacrelstats = (LVRelStats *) state;

You might want to update the comment of lazy_tid_reaped() as well.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-31 Thread Michael Paquier
On Tue, Jan 31, 2017 at 11:05 AM, Claudio Freire  wrote:
> Updated and rebased v7 attached.

Moved to CF 2017-03.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-30 Thread Claudio Freire
On Mon, Jan 30, 2017 at 5:51 AM, Masahiko Sawada  wrote:
> 
>  * We are willing to use at most maintenance_work_mem (or perhaps
>  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>  * initially allocate an array of TIDs of that size, with an upper limit that
>  * depends on table size (this limit ensures we don't allocate a huge area
>  * uselessly for vacuuming small tables).  If the array threatens to overflow,
>
> I think that we need to update the above paragraph comment at top of
> vacuumlazy.c file.

Indeed, I missed that one. Fixing.

>
> 
> +   numtuples = Max(numtuples,
> MaxHeapTuplesPerPage);
> +   numtuples = Min(numtuples, INT_MAX / 2);
> +   numtuples = Min(numtuples, 2 *
> pseg->max_dead_tuples);
> +   numtuples = Min(numtuples,
> MaxAllocSize / sizeof(ItemPointerData));
> +   seg->dt_tids = (ItemPointer)
> palloc(sizeof(ItemPointerData) * numtuples);
>
> Why numtuples is limited to "INT_MAX / 2" but not INT_MAX?

I forgot to mention this one in the OP.

Googling around, I found out some implemetations of bsearch break with
array sizes beyond INT_MAX/2 [1] (they'd overflow when computing the
midpoint).

Before this patch, this bsearch call had no way of reaching that size.
An initial version of the patch (the one that allocated a big array
with huge allocation) could reach that point, though, so I reduced the
limit to play it safe. This latest version is back to the starting
point, since it cannot allocate segments bigger than 1GB, but I opted
to keep playing it safe and leave the reduced limit just in case.

> 
> @@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats
> *vacrelstats)
> pg_rusage_init();
> npages = 0;
>
> -   tupindex = 0;
> -   while (tupindex < vacrelstats->num_dead_tuples)
> +   segindex = 0;
> +   tottuples = 0;
> +   for (segindex = tupindex = 0; segindex <=
> vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)
> {
> -   BlockNumber tblk;
> -   Buffer  buf;
> -   Pagepage;
> -   Sizefreespace;
>
> This is a minute thing but tupindex can be define inside of for loop.

Right, changing.

>
> 
> @@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options,
> LVRelStats *vacrelstats,
>   * instead of doing a second scan.
>   */
>  if (nindexes == 0 &&
> -vacrelstats->num_dead_tuples > 0)
> +vacrelstats->dead_tuples.num_entries > 0)
>  {
>  /* Remove tuples from heap */
> -lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, );
> +Assert(vacrelstats->dead_tuples.last_seg == 0);/*
> Should not need more
> + *
> than one segment per
> + * page */
>
> I'm not sure we need to add Assert() here but it seems to me that the
> comment and code is not properly correspond and the comment for
> Assert() should be wrote above of Assert() line.

Well, that assert is the one that found the second bug in
lazy_clear_dead_tuples, so clearly it's not without merit.

I'll rearrange the comments as you ask though.


Updated and rebased v7 attached.


[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=776671
From d32610b0ad6b9413aa4b2d808019d3c67d578f3c Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 314 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 ++
 3 files changed, 268 insertions(+), 64 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..8cb614f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -12,11 +12,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
- * we suspend the heap scan phase and perform a pass of index 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-30 Thread Masahiko Sawada
On Thu, Jan 26, 2017 at 5:11 AM, Claudio Freire  wrote:
> On Wed, Jan 25, 2017 at 1:54 PM, Masahiko Sawada  
> wrote:
>> Thank you for updating the patch!
>>
>> +   /*
>> +* Quickly rule out by lower bound (should happen a lot) Upper bound 
>> was
>> +* already checked by segment search
>> +*/
>> +   if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 
>> 0)
>> +   return false;
>>
>> I think that if the above result is 0, we can return true as itemptr
>> matched lower bound item pointer in rseg->dead_tuples.
>
> That's right. Possibly not a great speedup but... why not?
>
>>
>>  +typedef struct DeadTuplesSegment
>> +{
>> +   int num_dead_tuples;/* # of
>> entries in the segment */
>> +   int max_dead_tuples;/* # of
>> entries allocated in the segment */
>> +   ItemPointerData last_dead_tuple;/* Copy of the last
>> dead tuple (unset
>> +
>>   * until the segment is fully
>> +
>>   * populated) */
>> +   unsigned short padding;
>> +   ItemPointer dead_tuples;/* Array of dead tuples */
>> +}  DeadTuplesSegment;
>> +
>> +typedef struct DeadTuplesMultiArray
>> +{
>> +   int num_entries;/* current # of entries */
>> +   int max_entries;/* total # of slots
>> that can be allocated in
>> +* array */
>> +   int num_segs;   /* number of
>> dead tuple segments allocated */
>> +   int last_seg;   /* last dead
>> tuple segment with data (or 0) */
>> +   DeadTuplesSegment *dead_tuples; /* array of num_segs 
>> segments */
>> +}  DeadTuplesMultiArray;
>>
>> It's a matter of personal preference but some same dead_tuples
>> variables having different meaning confused me.
>> If we want to access first dead tuple location of first segment, we
>> need to do 'vacrelstats->dead_tuples.dead_tuples.dead_tuples'. For
>> example, 'vacrelstats->dead_tuples.dt_segment.dt_array' is better to
>> me.
>
> Yes, I can see how that could be confusing.
>
> I went for vacrelstats->dead_tuples.dt_segments[i].dt_tids[j]

Thank you for updating.
Looks good to me.

>> +   nseg->num_dead_tuples = 0;
>> +   nseg->max_dead_tuples = 0;
>> +   nseg->dead_tuples = NULL;
>> +   vacrelstats->dead_tuples.num_segs++;
>> +   }
>> +   seg = DeadTuplesCurrentSegment(vacrelstats);
>> +   }
>> +   vacrelstats->dead_tuples.last_seg++;
>> +   seg = DeadTuplesCurrentSegment(vacrelstats);
>>
>> Because seg is always set later I think the first line starting with
>> "seg = ..." is not necessary. Thought?
>
> That's correct.
>
> Attached a v6 with those changes (and rebased).
>
> Make check still passes.

Here is review comment of v6 patch.


 * We are willing to use at most maintenance_work_mem (or perhaps
 * autovacuum_work_mem) memory space to keep track of dead tuples.  We
 * initially allocate an array of TIDs of that size, with an upper limit that
 * depends on table size (this limit ensures we don't allocate a huge area
 * uselessly for vacuuming small tables).  If the array threatens to overflow,

I think that we need to update the above paragraph comment at top of
vacuumlazy.c file.


+   numtuples = Max(numtuples,
MaxHeapTuplesPerPage);
+   numtuples = Min(numtuples, INT_MAX / 2);
+   numtuples = Min(numtuples, 2 *
pseg->max_dead_tuples);
+   numtuples = Min(numtuples,
MaxAllocSize / sizeof(ItemPointerData));
+   seg->dt_tids = (ItemPointer)
palloc(sizeof(ItemPointerData) * numtuples);

Why numtuples is limited to "INT_MAX / 2" but not INT_MAX?


@@ -1376,35 +1411,43 @@ lazy_vacuum_heap(Relation onerel, LVRelStats
*vacrelstats)
pg_rusage_init();
npages = 0;

-   tupindex = 0;
-   while (tupindex < vacrelstats->num_dead_tuples)
+   segindex = 0;
+   tottuples = 0;
+   for (segindex = tupindex = 0; segindex <=
vacrelstats->dead_tuples.last_seg; tupindex = 0, segindex++)
{
-   BlockNumber tblk;
-   Buffer  buf;
-   Pagepage;
-   Sizefreespace;

This is a minute thing but tupindex can be define inside of for loop.


@@ -1129,10 +1159,13 @@ lazy_scan_heap(Relation onerel, int options,
LVRelStats *vacrelstats,
  * instead of doing a second scan.
  */
 if (nindexes == 0 &&
-

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-25 Thread Claudio Freire
On Wed, Jan 25, 2017 at 1:54 PM, Masahiko Sawada  wrote:
> Thank you for updating the patch!
>
> +   /*
> +* Quickly rule out by lower bound (should happen a lot) Upper bound 
> was
> +* already checked by segment search
> +*/
> +   if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 0)
> +   return false;
>
> I think that if the above result is 0, we can return true as itemptr
> matched lower bound item pointer in rseg->dead_tuples.

That's right. Possibly not a great speedup but... why not?

>
>  +typedef struct DeadTuplesSegment
> +{
> +   int num_dead_tuples;/* # of
> entries in the segment */
> +   int max_dead_tuples;/* # of
> entries allocated in the segment */
> +   ItemPointerData last_dead_tuple;/* Copy of the last
> dead tuple (unset
> +
>   * until the segment is fully
> +
>   * populated) */
> +   unsigned short padding;
> +   ItemPointer dead_tuples;/* Array of dead tuples */
> +}  DeadTuplesSegment;
> +
> +typedef struct DeadTuplesMultiArray
> +{
> +   int num_entries;/* current # of entries */
> +   int max_entries;/* total # of slots
> that can be allocated in
> +* array */
> +   int num_segs;   /* number of
> dead tuple segments allocated */
> +   int last_seg;   /* last dead
> tuple segment with data (or 0) */
> +   DeadTuplesSegment *dead_tuples; /* array of num_segs segments 
> */
> +}  DeadTuplesMultiArray;
>
> It's a matter of personal preference but some same dead_tuples
> variables having different meaning confused me.
> If we want to access first dead tuple location of first segment, we
> need to do 'vacrelstats->dead_tuples.dead_tuples.dead_tuples'. For
> example, 'vacrelstats->dead_tuples.dt_segment.dt_array' is better to
> me.

Yes, I can see how that could be confusing.

I went for vacrelstats->dead_tuples.dt_segments[i].dt_tids[j]

> +   nseg->num_dead_tuples = 0;
> +   nseg->max_dead_tuples = 0;
> +   nseg->dead_tuples = NULL;
> +   vacrelstats->dead_tuples.num_segs++;
> +   }
> +   seg = DeadTuplesCurrentSegment(vacrelstats);
> +   }
> +   vacrelstats->dead_tuples.last_seg++;
> +   seg = DeadTuplesCurrentSegment(vacrelstats);
>
> Because seg is always set later I think the first line starting with
> "seg = ..." is not necessary. Thought?

That's correct.

Attached a v6 with those changes (and rebased).

Make check still passes.
From c89019089a71517befac0920f22ed75577dda6c6 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 291 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 ++
 3 files changed, 250 insertions(+), 59 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..1d2441f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -93,6 +93,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -104,6 +112,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated) */
+	unsigned short padding;
+	ItemPointer dt_tids;	/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-25 Thread Masahiko Sawada
On Tue, Jan 24, 2017 at 1:49 AM, Claudio Freire  wrote:
> On Fri, Jan 20, 2017 at 6:24 AM, Masahiko Sawada  
> wrote:
>> On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire  
>> wrote:
>>> On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
>>>  wrote:
 28.12.2016 23:43, Claudio Freire:

 Attached v4 patches with the requested fixes.


 Sorry for being late, but the tests took a lot of time.
>>>
>>> I know. Takes me several days to run my test scripts once.
>>>
 create table t1 as select i, md5(random()::text) from
 generate_series(0,4) as i;
 create index md5_idx ON  t1(md5);
 update t1 set md5 = md5((random() * (100 + 500))::text);
 vacuum;

 Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
 while for old version it took three passes (1GB+1GB+0.9GB).
 Vacuum duration results:

 vanilla:
 LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
 patched:
 LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;

 We can see 30% vacuum speedup. I should note that this case can be
 considered
 as favorable to vanilla vacuum: the table is not that big, it has just one
 index
 and disk used is a fast fusionIO. We can expect even more gain on slower
 disks.

 Thank you again for the patch. Hope to see it in 10.0.
>>>
>>> Cool. Thanks for the review and the tests.
>>>
>>
>> I encountered a bug with following scenario.
>> 1. Create table and disable autovacuum on that table.
>> 2. Make about 20 dead tuples on the table.
>> 3. SET maintenance_work_mem TO 1024
>> 4. VACUUM
>>
>> @@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options,
>> LVRelStats *vacrelstats,
>>  * not to reset latestRemovedXid since we want
>> that value to be
>>  * valid.
>>  */
>> -   vacrelstats->num_dead_tuples = 0;
>> +   lazy_clear_dead_tuples(vacrelstats);
>> vacrelstats->num_index_scans++;
>>
>> /* Report that we are once again scanning the heap */
>>
>> I think that we should do vacrelstats->dead_tuples.num_entries = 0 as
>> well in lazy_clear_dead_tuples(). Once the amount of dead tuples
>> reached to maintenance_work_mem, lazy_scan_heap can never finish.
>
> That's right.
>
> I added a test for it in the attached patch set, which uncovered
> another bug in lazy_clear_dead_tuples, and took the opportunity to
> rebase.
>
> On Mon, Jan 23, 2017 at 1:06 PM, Alvaro Herrera
>  wrote:
>> I pushed this patch after rewriting it rather completely.  I added
>> tracing notices to inspect the blocks it was prefetching and observed
>> that the original coding was failing to prefetch the final streak of
>> blocks in the table, which is an important oversight considering that it
>> may very well be that those are the only blocks to read at all.
>>
>> I timed vacuuming a 4000-block table in my laptop (single SSD disk;
>> dropped FS caches after deleting all rows in table, so that vacuum has
>> to read all blocks from disk); it changes from 387ms without patch to
>> 155ms with patch.  I didn't measure how much it takes to run the other
>> steps in the vacuum, but it's clear that the speedup for the truncation
>> phase is considerable.
>>
>> ĄThanks, Claudio!
>
> Cool.
>
> Though it wasn't the first time this idea has been floating around, I
> can't take all the credit.
>
>
> On Fri, Jan 20, 2017 at 6:25 PM, Alvaro Herrera
>  wrote:
>> FWIW, I think this patch is completely separate from the maint_work_mem
>> patch and should have had its own thread and its own commitfest entry.
>> I intend to get a look at the other patch next week, after pushing this
>> one.
>
> That's because it did have it, and was left in limbo due to lack of
> testing on SSDs. I just had to adopt it here because otherwise tests
> took way too long.

Thank you for updating the patch!

+   /*
+* Quickly rule out by lower bound (should happen a lot) Upper bound was
+* already checked by segment search
+*/
+   if (vac_cmp_itemptr((void *) itemptr, (void *) rseg->dead_tuples) < 0)
+   return false;

I think that if the above result is 0, we can return true as itemptr
matched lower bound item pointer in rseg->dead_tuples.

 +typedef struct DeadTuplesSegment
+{
+   int num_dead_tuples;/* # of
entries in the segment */
+   int max_dead_tuples;/* # of
entries allocated in the segment */
+   ItemPointerData last_dead_tuple;/* Copy of the last
dead tuple (unset
+
  * until the segment is fully
+
  * populated) */
+   unsigned short padding;
+   ItemPointer dead_tuples;

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-23 Thread Claudio Freire
On Fri, Jan 20, 2017 at 6:24 AM, Masahiko Sawada  wrote:
> On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire  
> wrote:
>> On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
>>  wrote:
>>> 28.12.2016 23:43, Claudio Freire:
>>>
>>> Attached v4 patches with the requested fixes.
>>>
>>>
>>> Sorry for being late, but the tests took a lot of time.
>>
>> I know. Takes me several days to run my test scripts once.
>>
>>> create table t1 as select i, md5(random()::text) from
>>> generate_series(0,4) as i;
>>> create index md5_idx ON  t1(md5);
>>> update t1 set md5 = md5((random() * (100 + 500))::text);
>>> vacuum;
>>>
>>> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
>>> while for old version it took three passes (1GB+1GB+0.9GB).
>>> Vacuum duration results:
>>>
>>> vanilla:
>>> LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
>>> patched:
>>> LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;
>>>
>>> We can see 30% vacuum speedup. I should note that this case can be
>>> considered
>>> as favorable to vanilla vacuum: the table is not that big, it has just one
>>> index
>>> and disk used is a fast fusionIO. We can expect even more gain on slower
>>> disks.
>>>
>>> Thank you again for the patch. Hope to see it in 10.0.
>>
>> Cool. Thanks for the review and the tests.
>>
>
> I encountered a bug with following scenario.
> 1. Create table and disable autovacuum on that table.
> 2. Make about 20 dead tuples on the table.
> 3. SET maintenance_work_mem TO 1024
> 4. VACUUM
>
> @@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options,
> LVRelStats *vacrelstats,
>  * not to reset latestRemovedXid since we want
> that value to be
>  * valid.
>  */
> -   vacrelstats->num_dead_tuples = 0;
> +   lazy_clear_dead_tuples(vacrelstats);
> vacrelstats->num_index_scans++;
>
> /* Report that we are once again scanning the heap */
>
> I think that we should do vacrelstats->dead_tuples.num_entries = 0 as
> well in lazy_clear_dead_tuples(). Once the amount of dead tuples
> reached to maintenance_work_mem, lazy_scan_heap can never finish.

That's right.

I added a test for it in the attached patch set, which uncovered
another bug in lazy_clear_dead_tuples, and took the opportunity to
rebase.

On Mon, Jan 23, 2017 at 1:06 PM, Alvaro Herrera
 wrote:
> I pushed this patch after rewriting it rather completely.  I added
> tracing notices to inspect the blocks it was prefetching and observed
> that the original coding was failing to prefetch the final streak of
> blocks in the table, which is an important oversight considering that it
> may very well be that those are the only blocks to read at all.
>
> I timed vacuuming a 4000-block table in my laptop (single SSD disk;
> dropped FS caches after deleting all rows in table, so that vacuum has
> to read all blocks from disk); it changes from 387ms without patch to
> 155ms with patch.  I didn't measure how much it takes to run the other
> steps in the vacuum, but it's clear that the speedup for the truncation
> phase is considerable.
>
> ĄThanks, Claudio!

Cool.

Though it wasn't the first time this idea has been floating around, I
can't take all the credit.


On Fri, Jan 20, 2017 at 6:25 PM, Alvaro Herrera
 wrote:
> FWIW, I think this patch is completely separate from the maint_work_mem
> patch and should have had its own thread and its own commitfest entry.
> I intend to get a look at the other patch next week, after pushing this
> one.

That's because it did have it, and was left in limbo due to lack of
testing on SSDs. I just had to adopt it here because otherwise tests
took way too long.
From 92c610b4ed064afba0f914efd78a03c61f4742c6 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c| 288 ---
 src/test/regress/expected/vacuum.out |   8 +
 src/test/regress/sql/vacuum.sql  |  10 ++
 3 files changed, 247 insertions(+), 59 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 005440e..0b1d347 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -93,6 +93,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-23 Thread Alvaro Herrera
I think this patch no longer applies because of conflicts with the one I
just pushed.  Please rebase.

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-23 Thread Alvaro Herrera
I pushed this patch after rewriting it rather completely.  I added
tracing notices to inspect the blocks it was prefetching and observed
that the original coding was failing to prefetch the final streak of
blocks in the table, which is an important oversight considering that it
may very well be that those are the only blocks to read at all.

I timed vacuuming a 4000-block table in my laptop (single SSD disk;
dropped FS caches after deleting all rows in table, so that vacuum has
to read all blocks from disk); it changes from 387ms without patch to
155ms with patch.  I didn't measure how much it takes to run the other
steps in the vacuum, but it's clear that the speedup for the truncation
phase is considerable.

¡Thanks, Claudio!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-20 Thread Alvaro Herrera
Alvaro Herrera wrote:

> There was no discussion whatsoever of the "prefetch" patch in this
> thread; and as far as I can see, nobody even mentioned such an idea in
> the thread.  This prefetch patch appeared out of the blue and there was
> no discussion about it that I can see.  Now I was about to push it after
> some minor tweaks, and went to search where was its justification, only
> to see that there was none.  Did anybody run tests with this patch?
> 
> I attach it now one more time.  My version is based on the latest
> Claudio posted at
> https://postgr.es/m/CAGTBQpa464RugxYwxLTtDi=syv9gngfcjk8uzb2fr6nddqu...@mail.gmail.com
> I don't know if there are differences to the version first posted.
> I only changed the magic number 32 to a #define, and added a
> CHECK_FOR_INTERRUPTS in the prefetching loop.

Ah, I found the justification here:
https://www.postgresql.org/message-id/flat/CAGTBQpa464RugxYwxLTtDi%3DSyv9GnGFcJK8uZb2fR6NDDqULaw%40mail.gmail.com#CAGTBQpbayY-t5-ySW19yQs1dBqvV6dm8dmdpTv_FWXmDC0A0cQ%40mail.gmail.com
apparently the truncate scan is 4x-6x faster with this prefetching.
Nice!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-20 Thread Alvaro Herrera
You posted two patches with this preamble:

Claudio Freire wrote:

> Attached is the raw output of the test, the script used to create it,
> and just in case the patch set used. I believe it's the same as the
> last one I posted, just rebased.

There was no discussion whatsoever of the "prefetch" patch in this
thread; and as far as I can see, nobody even mentioned such an idea in
the thread.  This prefetch patch appeared out of the blue and there was
no discussion about it that I can see.  Now I was about to push it after
some minor tweaks, and went to search where was its justification, only
to see that there was none.  Did anybody run tests with this patch?

I attach it now one more time.  My version is based on the latest
Claudio posted at
https://postgr.es/m/CAGTBQpa464RugxYwxLTtDi=syv9gngfcjk8uzb2fr6nddqu...@mail.gmail.com
I don't know if there are differences to the version first posted.
I only changed the magic number 32 to a #define, and added a
CHECK_FOR_INTERRUPTS in the prefetching loop.

FWIW, I think this patch is completely separate from the maint_work_mem
patch and should have had its own thread and its own commitfest entry.
I intend to get a look at the other patch next week, after pushing this
one.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From bd10f56beb365e3cbfadb87d27f8aeb4b33f880e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 20 Jan 2017 18:04:51 -0300
Subject: [PATCH] Prefetch blocks during lazy vacuum's truncation scan

The truncation scan can be sped up on rotating media by prefetching
blocks in forward direction, because the is a cue for the operating
system's readahead to kick in, so that by the time we request those
blocks, they are already in memory.

Author: Claudio Freire
Discussion: 
https://postgr.es/m/cagtbqpa6nfgo_6g_y_7zqx8l9gchdsqkydo1tguh791z6py...@mail.gmail.com
---
 src/backend/commands/vacuumlazy.c | 36 ++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c 
b/src/backend/commands/vacuumlazy.c
index a2999b3..e676072 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -98,6 +98,12 @@
  */
 #define SKIP_PAGES_THRESHOLD   ((BlockNumber) 32)
 
+/*
+ * Size of the prefetch window for lazy vacuum backwards truncation scan.
+ * Needs to be a power of 2.
+ */
+#define PREFETCH_SIZE  32
+
 typedef struct LVRelStats
 {
/* hasindex = true means two-pass strategy; false means one-pass */
@@ -1825,14 +1831,24 @@ lazy_truncate_heap(Relation onerel, LVRelStats 
*vacrelstats)
 static BlockNumber
 count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 {
-   BlockNumber blkno;
+   BlockNumber blkno,
+   prefetchBlkno;
instr_time  starttime;
 
/* Initialize the starttime if we check for conflicting lock requests */
INSTR_TIME_SET_CURRENT(starttime);
 
-   /* Strange coding of loop control is needed because blkno is unsigned */
+   /*
+* Start checking blocks at what we believe relation end to be and move
+* backwards.  (Strange coding of loop control is needed because blkno 
is
+* unsigned.)  To make it a bit faster, we prefetch a bunch of blocks 
at a
+* time in forward direction, so that OS-level readahead can kick in to
+* speed this up.
+*/
blkno = vacrelstats->rel_pages;
+   prefetchBlkno = blkno & ~(PREFETCH_SIZE - 1);
+   prefetchBlkno =
+   (prefetchBlkno > PREFETCH_SIZE) ? prefetchBlkno - PREFETCH_SIZE 
: 0;
while (blkno > vacrelstats->nonempty_pages)
{
Buffer  buf;
@@ -1882,6 +1898,22 @@ count_nondeletable_pages(Relation onerel, LVRelStats 
*vacrelstats)
 
blkno--;
 
+   /* If we haven't prefetched this lot yet, do so now. */
+   if (blkno <= prefetchBlkno)
+   {
+   BlockNumber pblkno;
+
+   if (prefetchBlkno >= PREFETCH_SIZE)
+   prefetchBlkno -= PREFETCH_SIZE;
+   else
+   prefetchBlkno = 0;
+   for (pblkno = prefetchBlkno; pblkno < blkno; pblkno++)
+   {
+   PrefetchBuffer(onerel, MAIN_FORKNUM, pblkno);
+   CHECK_FOR_INTERRUPTS();
+   }
+   }
+
buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
 RBM_NORMAL, 
vac_strategy);
 
-- 
2.1.4


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-20 Thread Masahiko Sawada
On Thu, Jan 19, 2017 at 8:31 PM, Claudio Freire  wrote:
> On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
>  wrote:
>> 28.12.2016 23:43, Claudio Freire:
>>
>> Attached v4 patches with the requested fixes.
>>
>>
>> Sorry for being late, but the tests took a lot of time.
>
> I know. Takes me several days to run my test scripts once.
>
>> create table t1 as select i, md5(random()::text) from
>> generate_series(0,4) as i;
>> create index md5_idx ON  t1(md5);
>> update t1 set md5 = md5((random() * (100 + 500))::text);
>> vacuum;
>>
>> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
>> while for old version it took three passes (1GB+1GB+0.9GB).
>> Vacuum duration results:
>>
>> vanilla:
>> LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
>> patched:
>> LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;
>>
>> We can see 30% vacuum speedup. I should note that this case can be
>> considered
>> as favorable to vanilla vacuum: the table is not that big, it has just one
>> index
>> and disk used is a fast fusionIO. We can expect even more gain on slower
>> disks.
>>
>> Thank you again for the patch. Hope to see it in 10.0.
>
> Cool. Thanks for the review and the tests.
>

I encountered a bug with following scenario.
1. Create table and disable autovacuum on that table.
2. Make about 20 dead tuples on the table.
3. SET maintenance_work_mem TO 1024
4. VACUUM

@@ -729,7 +759,7 @@ lazy_scan_heap(Relation onerel, int options,
LVRelStats *vacrelstats,
 * not to reset latestRemovedXid since we want
that value to be
 * valid.
 */
-   vacrelstats->num_dead_tuples = 0;
+   lazy_clear_dead_tuples(vacrelstats);
vacrelstats->num_index_scans++;

/* Report that we are once again scanning the heap */

I think that we should do vacrelstats->dead_tuples.num_entries = 0 as
well in lazy_clear_dead_tuples(). Once the amount of dead tuples
reached to maintenance_work_mem, lazy_scan_heap can never finish.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-19 Thread Claudio Freire
On Thu, Jan 19, 2017 at 6:33 AM, Anastasia Lubennikova
 wrote:
> 28.12.2016 23:43, Claudio Freire:
>
> Attached v4 patches with the requested fixes.
>
>
> Sorry for being late, but the tests took a lot of time.

I know. Takes me several days to run my test scripts once.

> create table t1 as select i, md5(random()::text) from
> generate_series(0,4) as i;
> create index md5_idx ON  t1(md5);
> update t1 set md5 = md5((random() * (100 + 500))::text);
> vacuum;
>
> Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
> while for old version it took three passes (1GB+1GB+0.9GB).
> Vacuum duration results:
>
> vanilla:
> LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
> patched:
> LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;
>
> We can see 30% vacuum speedup. I should note that this case can be
> considered
> as favorable to vanilla vacuum: the table is not that big, it has just one
> index
> and disk used is a fast fusionIO. We can expect even more gain on slower
> disks.
>
> Thank you again for the patch. Hope to see it in 10.0.

Cool. Thanks for the review and the tests.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-01-19 Thread Anastasia Lubennikova

28.12.2016 23:43, Claudio Freire:

Attached v4 patches with the requested fixes.


Sorry for being late, but the tests took a lot of time.

create table t1 as select i, md5(random()::text) from 
generate_series(0,4) as i;
create index md5_idx ON  t1(md5);
update t1 set md5 = md5((random() * (100 + 500))::text);
vacuum;

Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass,
while for old version it took three passes (1GB+1GB+0.9GB).
Vacuum duration results:

vanilla:
LOG: duration: 4359006.327 ms  statement: vacuum verbose t1;
patched:
LOG: duration: 3076827.378 ms  statement: vacuum verbose t1;

We can see 30% vacuum speedup. I should note that this case can be 
considered
as favorable to vanilla vacuum: the table is not that big, it has just 
one index
and disk used is a fast fusionIO. We can expect even more gain on slower 
disks.


Thank you again for the patch. Hope to see it in 10.0.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-28 Thread Claudio Freire
On Wed, Dec 28, 2016 at 3:41 PM, Claudio Freire  wrote:
>> Anyway, I found the problem that had caused segfault.
>>
>> for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex =
>> 0, segindex++)
>> {
>> DeadTuplesSegment *seg =
>> &(vacrelstats->dead_tuples.dead_tuples[segindex]);
>> intnum_dead_tuples = seg->num_dead_tuples;
>>
>> while (tupindex < num_dead_tuples)
>> ...
>>
>> You rely on the value of tupindex here, while during the very first pass the
>> 'tupindex' variable
>> may contain any garbage. And it happend that on my system there was negative
>> value
>> as I found inspecting core dump:
>>
>> (gdb) info locals
>> num_dead_tuples = 5
>> tottuples = 0
>> tupindex = -1819017215
>>
>> Which leads to failure in the next line
>> tblk = ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
>>
>> The solution is to move this assignment inside the cycle.
>
> Good catch. I read that line suspecting that very same thing but
> somehow I was blind to it.

Attached v4 patches with the requested fixes.
From 988527b49ec0f48ea7fc85c7533e8eaa07d6fc3b Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Fri, 2 Sep 2016 23:33:48 -0300
Subject: [PATCH 1/2] Vacuum: prefetch buffers on backward scan

This patch speeds up truncation by prefetching buffers during
the backward scan phase just prior to truncation. This optimization
helps rotating media because disk rotation induces a buffer-to-buffer
latency of roughly a whole rotation when reading backwards, such
disks are usually optimized for forward scans. Prefetching buffers
in larger chunks ameliorates the issue considerably and is otherwise
harmless. The prefetch amount should also not be an issue on SSD,
which won't benefit from the optimization, but won't be hurt either.
---
 src/backend/commands/vacuumlazy.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..854bce3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1825,7 +1825,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 static BlockNumber
 count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 {
-	BlockNumber blkno;
+	BlockNumber blkno, prefetchBlkno;
 	instr_time	starttime;
 
 	/* Initialize the starttime if we check for conflicting lock requests */
@@ -1833,6 +1833,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 	/* Strange coding of loop control is needed because blkno is unsigned */
 	blkno = vacrelstats->rel_pages;
+	prefetchBlkno = blkno & ~0x1f;
+	prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
 	while (blkno > vacrelstats->nonempty_pages)
 	{
 		Buffer		buf;
@@ -1882,6 +1884,23 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 		blkno--;
 
+		/*
+		 * Speed up reads on rotating rust by prefetching a few blocks ahead.
+		 * Backward scans on rotary disks are slow if done one buffer at a time
+		 * because readahead won't kick in on most OSes, and each buffer will
+		 * have to wait for the platter to do a full rotation.
+		 * Should be harmless on SSD.
+		 */
+		if (blkno <= prefetchBlkno) {
+			BlockNumber pblkno;
+			if (prefetchBlkno >= 32)
+prefetchBlkno -= 32;
+			else
+prefetchBlkno = 0;
+			for (pblkno = prefetchBlkno; pblkno < blkno; pblkno++)
+PrefetchBuffer(onerel, MAIN_FORKNUM, pblkno);
+		}
+
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
  RBM_NORMAL, vac_strategy);
 
-- 
1.8.4.5

From d360720415a2db5d5285a757c9a06e7ed854c5c5 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 2/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c | 302 ++
 1 file changed, 237 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 854bce3..8b714e0 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -93,11 +93,40 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-28 Thread Claudio Freire
On Wed, Dec 28, 2016 at 10:26 AM, Anastasia Lubennikova
 wrote:
> 27.12.2016 20:14, Claudio Freire:
>
> On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
>  wrote:
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> 1417tblk =
> ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
> (gdb) bt
> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> #1  0x00693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9,
> vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
> at vacuumlazy.c:1337
> #2  0x00691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9,
> params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
> #3  0x0069191f in vacuum_rel (relid=1247, relation=0x0, options=9,
> params=0x7ffe0f866310) at vacuum.c:1418
>
> Those line numbers don't match my code.
>
> Which commit are you based on?
>
> My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610
>
>
> Hm, my branch is based on 71f996d22125eb6cfdbee6094f44370aa8dec610 as well.
> I merely applied patches
> 0001-Vacuum-prefetch-buffers-on-backward-scan-v3.patch
> and 0002-Vacuum-allow-using-more-than-1GB-work-mem-v3.patch
> then ran configure and make as usual.
> Am I doing something wrong?

Doesn't sound wrong. Maybe it's my tree the unclean one. I'll have to
do a clean checkout to verify.

> Anyway, I found the problem that had caused segfault.
>
> for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; tupindex =
> 0, segindex++)
> {
> DeadTuplesSegment *seg =
> &(vacrelstats->dead_tuples.dead_tuples[segindex]);
> intnum_dead_tuples = seg->num_dead_tuples;
>
> while (tupindex < num_dead_tuples)
> ...
>
> You rely on the value of tupindex here, while during the very first pass the
> 'tupindex' variable
> may contain any garbage. And it happend that on my system there was negative
> value
> as I found inspecting core dump:
>
> (gdb) info locals
> num_dead_tuples = 5
> tottuples = 0
> tupindex = -1819017215
>
> Which leads to failure in the next line
> tblk = ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
>
> The solution is to move this assignment inside the cycle.

Good catch. I read that line suspecting that very same thing but
somehow I was blind to it.

> I've read the second patch.
>
> 1. What is the reason to inline vac_cmp_itemptr() ?

Performance, mostly. By inlining some transformations can be applied
that wouldn't be possible otherwise. During the binary search, this
matters performance-wise.

> 2. +#define LAZY_MIN_TUPLESMax(MaxHeapTuplesPerPage, (128<<20) /
> sizeof(ItemPointerData))
> What does 128<<20 mean? Why not 1<<27? I'd ask you to replace it with named
> constant,
> or at least add a comment.

I tought it was more readable like that, since 1<<20 is known to be
"MB", that reads as "128 MB".

I guess I can add a comment saying so.

> I'll share my results of performance testing it in a few days.

Thanks


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-28 Thread Anastasia Lubennikova

27.12.2016 20:14, Claudio Freire:

On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
  wrote:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
1417tblk =
ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
(gdb) bt
#0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
#1  0x00693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9,
vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
 at vacuumlazy.c:1337
#2  0x00691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9,
params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
#3  0x0069191f in vacuum_rel (relid=1247, relation=0x0, options=9,
params=0x7ffe0f866310) at vacuum.c:1418

Those line numbers don't match my code.

Which commit are you based on?

My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610


Hm, my branch is based on 71f996d22125eb6cfdbee6094f44370aa8dec610 as well.
I merely applied patches 
0001-Vacuum-prefetch-buffers-on-backward-scan-v3.patch

and 0002-Vacuum-allow-using-more-than-1GB-work-mem-v3.patch
then ran configure and make as usual.
Am I doing something wrong?

Anyway, I found the problem that had caused segfault.

for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; 
tupindex = 0, segindex++)

{
DeadTuplesSegment *seg = 
&(vacrelstats->dead_tuples.dead_tuples[segindex]);

intnum_dead_tuples = seg->num_dead_tuples;

while (tupindex < num_dead_tuples)
...

You rely on the value of tupindex here, while during the very first pass 
the 'tupindex' variable
may contain any garbage. And it happend that on my system there was 
negative value

as I found inspecting core dump:

(gdb) info locals
num_dead_tuples = 5
tottuples = 0
tupindex = -1819017215

Which leads to failure in the next line
tblk = ItemPointerGetBlockNumber(>dead_tuples[tupindex]);

The solution is to move this assignment inside the cycle.

I've read the second patch.

1. What is the reason to inline vac_cmp_itemptr() ?

2. +#define LAZY_MIN_TUPLESMax(MaxHeapTuplesPerPage, (128<<20) / 
sizeof(ItemPointerData))
What does 128<<20 mean? Why not 1<<27? I'd ask you to replace it with 
named constant,

or at least add a comment.

I'll share my results of performance testing it in a few days.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-28 Thread Anastasia Lubennikova

27.12.2016 16:54, Alvaro Herrera:

Anastasia Lubennikova wrote:


I ran configure using following set of flags:
  ./configure --enable-tap-tests --enable-cassert --enable-debug
--enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
And then ran make check. Here is the stacktrace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
1417tblk =
ItemPointerGetBlockNumber(>dead_tuples[tupindex]);

This doesn't make sense, since the patch removes the "tupindex"
variable in that function.



Sorry, I didn't get what you mean.

In  0002-Vacuum-allow-using-more-than-1GB-work-mem-v3.patch

-tupindex = 0;
-while (tupindex < vacrelstats->num_dead_tuples)
+segindex = 0;
+tottuples = 0;
+for (segindex = 0; segindex <= vacrelstats->dead_tuples.last_seg; 
tupindex = 0, segindex++)

 {

variable is still there, it was just moved to the loop.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-27 Thread Claudio Freire
On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
 wrote:
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> 1417tblk =
> ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
> (gdb) bt
> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> #1  0x00693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9,
> vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')
> at vacuumlazy.c:1337
> #2  0x00691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9,
> params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
> #3  0x0069191f in vacuum_rel (relid=1247, relation=0x0, options=9,
> params=0x7ffe0f866310) at vacuum.c:1418


Those line numbers don't match my code.

Which commit are you based on?

My tree is (currently) based on 71f996d22125eb6cfdbee6094f44370aa8dec610


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-27 Thread Claudio Freire
On Tue, Dec 27, 2016 at 10:54 AM, Alvaro Herrera
 wrote:
> Anastasia Lubennikova wrote:
>
>> I ran configure using following set of flags:
>>  ./configure --enable-tap-tests --enable-cassert --enable-debug
>> --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
>> And then ran make check. Here is the stacktrace:
>>
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
>> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
>> 1417tblk =
>> ItemPointerGetBlockNumber(>dead_tuples[tupindex]);
>
> This doesn't make sense, since the patch removes the "tupindex"
> variable in that function.

The variable is still there. It just has a slightly different meaning
(index within the current segment, rather than global index).


On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
 wrote:
> 23.12.2016 22:54, Claudio Freire:
>
> On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova
>  wrote:
>
> I found the reason. I configure postgres with CFLAGS="-O0" and it causes
> Segfault on initdb.
> It works fine and passes tests with default configure flags, but I'm pretty
> sure that we should fix segfault before testing the feature.
> If you need it, I'll send a core dump.
>
> I just ran it with CFLAGS="-O0" and it passes all checks too:
>
> CFLAGS='-O0' ./configure --enable-debug --enable-cassert
> make clean && make -j8 && make check-world
>
> A stacktrace and a thorough description of your build environment
> would be helpful to understand why it breaks on your system.
>
>
> I ran configure using following set of flags:
>  ./configure --enable-tap-tests --enable-cassert --enable-debug
> --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
> And then ran make check. Here is the stacktrace:

Same procedure runs fine on my end.

> core file is quite big, so I didn't attach it to the mail. You can download 
> it here: core dump file.

Can you attach your binary as well? (it needs to be identical to be
able to inspect the core dump, and quite clearly my build is
different)

I'll keep looking for ways it could crash there, but being unable to
reproduce the crash is a big hindrance, so if you can send the binary
that could help speed things up.


On Tue, Dec 27, 2016 at 10:41 AM, Anastasia Lubennikova
 wrote:
> 1. prefetchBlkno = blkno & ~0x1f;
> prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
>
> I didn't get it what for we need these tricks. How does it differ from:
> prefetchBlkno = (blkno > 32) ? blkno - 32 : 0;

It makes all prefetches ranges of 32 blocks aligned to 32-block boundaries.

It helps since it's at 32 block boundaries that the truncate stops to
check for locking conflicts and abort, guaranteeing no prefetch will
be needless (if it goes into that code it *will* read the next 32
blocks).

> 2. Why do we decrease prefetchBlckno twice?
>
> Here:
> +prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
> And here:
> if (prefetchBlkno >= 32)
> +prefetchBlkno -= 32;

The first one is outside the loop, it's finding the first prefetch
range that is boundary aligned, taking care not to cause underflow.

The second one is inside the loop, it's moving the prefetch window
down as the truncate moves along. Since it's already aligned, it
doesn't need to be realigned, just clamped to zero if it happens to
reach the bottom.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-27 Thread Alvaro Herrera
Anastasia Lubennikova wrote:

> I ran configure using following set of flags:
>  ./configure --enable-tap-tests --enable-cassert --enable-debug
> --enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"
> And then ran make check. Here is the stacktrace:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360,
> vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
> 1417tblk =
> ItemPointerGetBlockNumber(>dead_tuples[tupindex]);

This doesn't make sense, since the patch removes the "tupindex"
variable in that function.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-27 Thread Anastasia Lubennikova

23.12.2016 22:54, Claudio Freire:

On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova
  wrote:

I found the reason. I configure postgres with CFLAGS="-O0" and it causes
Segfault on initdb.
It works fine and passes tests with default configure flags, but I'm pretty
sure that we should fix segfault before testing the feature.
If you need it, I'll send a core dump.

I just ran it with CFLAGS="-O0" and it passes all checks too:

CFLAGS='-O0' ./configure --enable-debug --enable-cassert
make clean && make -j8 && make check-world

A stacktrace and a thorough description of your build environment
would be helpful to understand why it breaks on your system.


I ran configure using following set of flags:
 ./configure --enable-tap-tests --enable-cassert --enable-debug 
--enable-depend CFLAGS="-O0 -g3 -fno-omit-frame-pointer"

And then ran make check. Here is the stacktrace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, 
vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
1417tblk = 
ItemPointerGetBlockNumber(>dead_tuples[tupindex]);

(gdb) bt
#0  0x006941e7 in lazy_vacuum_heap (onerel=0x1ec2360, 
vacrelstats=0x1ef6e00) at vacuumlazy.c:1417
#1  0x00693dfe in lazy_scan_heap (onerel=0x1ec2360, options=9, 
vacrelstats=0x1ef6e00, Irel=0x1ef7168, nindexes=2, aggressive=1 '\001')

at vacuumlazy.c:1337
#2  0x00691e66 in lazy_vacuum_rel (onerel=0x1ec2360, options=9, 
params=0x7ffe0f866310, bstrategy=0x1f1c4a8) at vacuumlazy.c:290
#3  0x0069191f in vacuum_rel (relid=1247, relation=0x0, 
options=9, params=0x7ffe0f866310) at vacuum.c:1418
#4  0x00690122 in vacuum (options=9, relation=0x0, relid=0, 
params=0x7ffe0f866310, va_cols=0x0, bstrategy=0x1f1c4a8,

isTopLevel=1 '\001') at vacuum.c:320
#5  0x0068fd0b in vacuum (options=-1652367447, relation=0x0, 
relid=3324614038, params=0x1f11bf0, va_cols=0xb59f63,

bstrategy=0x1f1c620, isTopLevel=0 '\000') at vacuum.c:150
#6  0x00852993 in standard_ProcessUtility (parsetree=0x1f07e60, 
queryString=0x1f07468 "VACUUM FREEZE;\n",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0xea5cc0 
, completionTag=0x7ffe0f866750 "") at utility.c:669
#7  0x008520da in standard_ProcessUtility 
(parsetree=0x401ef6cd8, queryString=0x18 address 0x18>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x68, dest=0x9e5d62 
, completionTag=0x7ffe0f8663f0 "`~\360\001")

at utility.c:360
#8  0x00851161 in PortalRunMulti (portal=0x7ffe0f866750, 
isTopLevel=0 '\000', setHoldSnapshot=-39 '\331',
dest=0x851161 , altdest=0x7ffe0f8664f0, 
completionTag=0x1f07e60 "\341\002") at pquery.c:1219
#9  0x00851374 in PortalRunMulti (portal=0x1f0a488, isTopLevel=1 
'\001', setHoldSnapshot=0 '\000', dest=0xea5cc0 ,
altdest=0xea5cc0 , completionTag=0x7ffe0f866750 "") at 
pquery.c:1345
#10 0x00850889 in PortalRun (portal=0x1f0a488, 
count=9223372036854775807, isTopLevel=1 '\001', dest=0xea5cc0 ,
altdest=0xea5cc0 , completionTag=0x7ffe0f866750 "") at 
pquery.c:824
#11 0x0084a4dc in exec_simple_query (query_string=0x1f07468 
"VACUUM FREEZE;\n") at postgres.c:1113
#12 0x0084e960 in PostgresMain (argc=10, argv=0x1e60a50, 
dbname=0x1e823b0 "template1", username=0x1e672a0 "anastasia")

at postgres.c:4091
#13 0x006f967e in init_locale (categoryname=0x100 
,
category=32766, locale=0xa004692f0 address 0xa004692f0>) at main.c:310
#14 0x7f1e5f463830 in __libc_start_main (main=0x6f93e1 , 
argc=10, argv=0x7ffe0f866a78, init=,
fini=, rtld_fini=, 
stack_end=0x7ffe0f866a68) at ../csu/libc-start.c:291

#15 0x00469319 in _start ()

core file is quite big, so I didn't attach it to the mail. You can 
download it here: core dump file 
.


Here are some notes about the first patch:

1. prefetchBlkno = blkno & ~0x1f;
prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;

I didn't get it what for we need these tricks. How does it differ from:
prefetchBlkno = (blkno > 32) ? blkno - 32 : 0;

2. Why do we decrease prefetchBlckno twice?

Here:
+prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
And here:
if (prefetchBlkno >= 32)
+prefetchBlkno -= 32;


I'll inspect second patch in a few days and write questions about it.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-23 Thread Claudio Freire
On Fri, Dec 23, 2016 at 1:39 PM, Anastasia Lubennikova
 wrote:
>> On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire 
>> wrote:
>>>
>>> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
>>>  wrote:

 The following review has been posted through the commitfest application:
 make installcheck-world:  tested, failed
 Implements feature:   not tested
 Spec compliant:   not tested
 Documentation:not tested

 Hi,
 I haven't read through the thread yet, just tried to apply the patch and
 run tests.
 And it seems that the last attached version is outdated now. It doesn't
 apply to the master
 and after I've tried to fix merge conflict, it segfaults at initdb.
>>>
>>>
>>> I'll rebase when I get some time to do it and post an updated version
>>
>> Attached rebased patches. I called them both v3 to be consistent.
>>
>> I'm not sure how you ran it, but this works fine for me:
>>
>> ./configure --enable-debug --enable-cassert
>> make clean
>> make check
>>
>> ... after a while ...
>>
>> ===
>>   All 168 tests passed.
>> ===
>>
>> I reckon the above is equivalent to installcheck, but doesn't require
>> sudo or actually installing the server, so installcheck should work
>> assuming the install went ok.
>
>
> I found the reason. I configure postgres with CFLAGS="-O0" and it causes
> Segfault on initdb.
> It works fine and passes tests with default configure flags, but I'm pretty
> sure that we should fix segfault before testing the feature.
> If you need it, I'll send a core dump.

I just ran it with CFLAGS="-O0" and it passes all checks too:

CFLAGS='-O0' ./configure --enable-debug --enable-cassert
make clean && make -j8 && make check-world

A stacktrace and a thorough description of your build environment
would be helpful to understand why it breaks on your system.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-23 Thread Anastasia Lubennikova

22.12.2016 21:18, Claudio Freire:

On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire  wrote:

On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
 wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi,
I haven't read through the thread yet, just tried to apply the patch and run 
tests.
And it seems that the last attached version is outdated now. It doesn't apply 
to the master
and after I've tried to fix merge conflict, it segfaults at initdb.


I'll rebase when I get some time to do it and post an updated version

Attached rebased patches. I called them both v3 to be consistent.

I'm not sure how you ran it, but this works fine for me:

./configure --enable-debug --enable-cassert
make clean
make check

... after a while ...

===
  All 168 tests passed.
===

I reckon the above is equivalent to installcheck, but doesn't require
sudo or actually installing the server, so installcheck should work
assuming the install went ok.


I found the reason. I configure postgres with CFLAGS="-O0" and it causes 
Segfault on initdb.
It works fine and passes tests with default configure flags, but I'm 
pretty sure that we should fix segfault before testing the feature.

If you need it, I'll send a core dump.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-22 Thread Claudio Freire
On Thu, Dec 22, 2016 at 12:22 PM, Claudio Freire  wrote:
> On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
>  wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, failed
>> Implements feature:   not tested
>> Spec compliant:   not tested
>> Documentation:not tested
>>
>> Hi,
>> I haven't read through the thread yet, just tried to apply the patch and run 
>> tests.
>> And it seems that the last attached version is outdated now. It doesn't 
>> apply to the master
>> and after I've tried to fix merge conflict, it segfaults at initdb.
>
>
> I'll rebase when I get some time to do it and post an updated version

Attached rebased patches. I called them both v3 to be consistent.

I'm not sure how you ran it, but this works fine for me:

./configure --enable-debug --enable-cassert
make clean
make check

... after a while ...

===
 All 168 tests passed.
===

I reckon the above is equivalent to installcheck, but doesn't require
sudo or actually installing the server, so installcheck should work
assuming the install went ok.
From f7c6a46ea2d721e472237e00fd2404081a4ddc75 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Fri, 2 Sep 2016 23:33:48 -0300
Subject: [PATCH 1/2] Vacuum: prefetch buffers on backward scan

This patch speeds up truncation by prefetching buffers during
the backward scan phase just prior to truncation. This optimization
helps rotating media because disk rotation induces a buffer-to-buffer
latency of roughly a whole rotation when reading backwards, such
disks are usually optimized for forward scans. Prefetching buffers
in larger chunks ameliorates the issue considerably and is otherwise
harmless. The prefetch amount should also not be an issue on SSD,
which won't benefit from the optimization, but won't be hurt either.
---
 src/backend/commands/vacuumlazy.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index b5fb325..854bce3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1825,7 +1825,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 static BlockNumber
 count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 {
-	BlockNumber blkno;
+	BlockNumber blkno, prefetchBlkno;
 	instr_time	starttime;
 
 	/* Initialize the starttime if we check for conflicting lock requests */
@@ -1833,6 +1833,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 	/* Strange coding of loop control is needed because blkno is unsigned */
 	blkno = vacrelstats->rel_pages;
+	prefetchBlkno = blkno & ~0x1f;
+	prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
 	while (blkno > vacrelstats->nonempty_pages)
 	{
 		Buffer		buf;
@@ -1882,6 +1884,23 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 		blkno--;
 
+		/*
+		 * Speed up reads on rotating rust by prefetching a few blocks ahead.
+		 * Backward scans on rotary disks are slow if done one buffer at a time
+		 * because readahead won't kick in on most OSes, and each buffer will
+		 * have to wait for the platter to do a full rotation.
+		 * Should be harmless on SSD.
+		 */
+		if (blkno <= prefetchBlkno) {
+			BlockNumber pblkno;
+			if (prefetchBlkno >= 32)
+prefetchBlkno -= 32;
+			else
+prefetchBlkno = 0;
+			for (pblkno = prefetchBlkno; pblkno < blkno; pblkno++)
+PrefetchBuffer(onerel, MAIN_FORKNUM, pblkno);
+		}
+
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
  RBM_NORMAL, vac_strategy);
 
-- 
1.8.4.5

From 084805d565dced136903262cfad4fd395468a9bb Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 2/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c | 295 +-
 1 file changed, 230 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 854bce3..dfb0612 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -91,6 +91,7 @@
  * tables.
  */
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
 
 /*
  * Before we consider skipping a page that's marked as clean in
@@ -98,6 +99,27 @@
  */
 #define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	int			num_dead_tuples;	/* # of entries in the 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-22 Thread Claudio Freire
On Thu, Dec 22, 2016 at 12:15 PM, Anastasia Lubennikova
 wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:not tested
>
> Hi,
> I haven't read through the thread yet, just tried to apply the patch and run 
> tests.
> And it seems that the last attached version is outdated now. It doesn't apply 
> to the master
> and after I've tried to fix merge conflict, it segfaults at initdb.


I'll rebase when I get some time to do it and post an updated version


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-22 Thread Anastasia Lubennikova
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi,
I haven't read through the thread yet, just tried to apply the patch and run 
tests.
And it seems that the last attached version is outdated now. It doesn't apply 
to the master
and after I've tried to fix merge conflict, it segfaults at initdb. 

So, looking forward to a new version for review.

The new status of this patch is: Waiting on Author

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-12-04 Thread Haribabu Kommi
On Tue, Nov 22, 2016 at 4:53 AM, Claudio Freire 
wrote:

> On Mon, Nov 21, 2016 at 2:15 PM, Masahiko Sawada 
> wrote:
> > On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire 
> wrote:
> >> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas 
> wrote:
> >>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire <
> klaussfre...@gmail.com> wrote:
>  Attached is patch 0002 with pgindent applied over it
> 
>  I don't think there's any other formatting issue, but feel free to
>  point a finger to it if I missed any
> >>>
> >>> Hmm, I had imagined making all of the segments the same size rather
> >>> than having the size grow exponentially.  The whole point of this is
> >>> to save memory, and even in the worst case you don't end up with that
> >>> many segments as long as you pick a reasonable base size (e.g. 64MB).
> >>
> >> Wastage is bound by a fraction of the total required RAM, that is,
> >> it's proportional to the amount of required RAM, not the amount
> >> allocated. So it should still be fine, and the exponential strategy
> >> should improve lookup performance considerably.
> >
> > I'm concerned that it could use repalloc for large memory area when
> > vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead
> > and slow.
>
> How large?
>
> That array cannot be very large. It contains pointers to
> exponentially-growing arrays, but the repalloc'd array itself is
> small: one struct per segment, each segment starts at 128MB and grows
> exponentially.
>
> In fact, IIRC, it can be proven that such a repalloc strategy has an
> amortized cost of O(log log n) per tuple. If it repallocd the whole
> tid array it would be O(log n), but since it handles only pointers to
> segments of exponentially growing tuples it becomes O(log log n).
>
> Furthermore, n there is still limited to MAX_INT, which means the cost
> per tuple is bound by O(log log 2^32) = 5. And that's an absolute
> worst case that's ignoring the 128MB constant factor which is indeed
> relevant.
>
> > What about using semi-fixed memroy space without repalloc;
> > Allocate the array of ItemPointerData array, and each ItemPointerData
> > array stores the dead tuple locations. The size of ItemPointerData
> > array starts with small size (e.g. 16MB or 32MB). After we used an
> > array up, we then allocate next segment with twice size as previous
> > segment.
>
> That's what the patch does.
>
> > But to prevent over allocating memory, it would be better to
> > set a limit of allocating size of ItemPointerData array to 512MB or
> > 1GB.
>
> There already is a limit to 1GB (the maximum amount palloc can allocate)
>
>
Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-21 Thread Claudio Freire
On Mon, Nov 21, 2016 at 2:15 PM, Masahiko Sawada  wrote:
> On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire  
> wrote:
>> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas  wrote:
>>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire  
>>> wrote:
 Attached is patch 0002 with pgindent applied over it

 I don't think there's any other formatting issue, but feel free to
 point a finger to it if I missed any
>>>
>>> Hmm, I had imagined making all of the segments the same size rather
>>> than having the size grow exponentially.  The whole point of this is
>>> to save memory, and even in the worst case you don't end up with that
>>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>>
>> Wastage is bound by a fraction of the total required RAM, that is,
>> it's proportional to the amount of required RAM, not the amount
>> allocated. So it should still be fine, and the exponential strategy
>> should improve lookup performance considerably.
>
> I'm concerned that it could use repalloc for large memory area when
> vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead
> and slow.

How large?

That array cannot be very large. It contains pointers to
exponentially-growing arrays, but the repalloc'd array itself is
small: one struct per segment, each segment starts at 128MB and grows
exponentially.

In fact, IIRC, it can be proven that such a repalloc strategy has an
amortized cost of O(log log n) per tuple. If it repallocd the whole
tid array it would be O(log n), but since it handles only pointers to
segments of exponentially growing tuples it becomes O(log log n).

Furthermore, n there is still limited to MAX_INT, which means the cost
per tuple is bound by O(log log 2^32) = 5. And that's an absolute
worst case that's ignoring the 128MB constant factor which is indeed
relevant.

> What about using semi-fixed memroy space without repalloc;
> Allocate the array of ItemPointerData array, and each ItemPointerData
> array stores the dead tuple locations. The size of ItemPointerData
> array starts with small size (e.g. 16MB or 32MB). After we used an
> array up, we then allocate next segment with twice size as previous
> segment.

That's what the patch does.

> But to prevent over allocating memory, it would be better to
> set a limit of allocating size of ItemPointerData array to 512MB or
> 1GB.

There already is a limit to 1GB (the maximum amount palloc can allocate)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-21 Thread Masahiko Sawada
On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire  wrote:
> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas  wrote:
>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire  
>> wrote:
>>> Attached is patch 0002 with pgindent applied over it
>>>
>>> I don't think there's any other formatting issue, but feel free to
>>> point a finger to it if I missed any
>>
>> Hmm, I had imagined making all of the segments the same size rather
>> than having the size grow exponentially.  The whole point of this is
>> to save memory, and even in the worst case you don't end up with that
>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>
> Wastage is bound by a fraction of the total required RAM, that is,
> it's proportional to the amount of required RAM, not the amount
> allocated. So it should still be fine, and the exponential strategy
> should improve lookup performance considerably.

I'm concerned that it could use repalloc for large memory area when
vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead
and slow. What about using semi-fixed memroy space without repalloc;
Allocate the array of ItemPointerData array, and each ItemPointerData
array stores the dead tuple locations. The size of ItemPointerData
array starts with small size (e.g. 16MB or 32MB). After we used an
array up, we then allocate next segment with twice size as previous
segment. But to prevent over allocating memory, it would be better to
set a limit of allocating size of ItemPointerData array to 512MB or
1GB. We could expand array of array using repalloc if needed, but it
would not be happend much. Other design is similar to current your
patch; the offset of the array of array and the offset of a
ItemPointerData array control current location, which are cleared
after finished reclaiming garbage on heap and index. And allocated
array is re-used by subsequent scanning heap.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Claudio Freire
On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas  wrote:
> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire  
> wrote:
>> Attached is patch 0002 with pgindent applied over it
>>
>> I don't think there's any other formatting issue, but feel free to
>> point a finger to it if I missed any
>
> Hmm, I had imagined making all of the segments the same size rather
> than having the size grow exponentially.  The whole point of this is
> to save memory, and even in the worst case you don't end up with that
> many segments as long as you pick a reasonable base size (e.g. 64MB).

Wastage is bound by a fraction of the total required RAM, that is,
it's proportional to the amount of required RAM, not the amount
allocated. So it should still be fine, and the exponential strategy
should improve lookup performance considerably.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Robert Haas
On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire  wrote:
> Attached is patch 0002 with pgindent applied over it
>
> I don't think there's any other formatting issue, but feel free to
> point a finger to it if I missed any

Hmm, I had imagined making all of the segments the same size rather
than having the size grow exponentially.  The whole point of this is
to save memory, and even in the worst case you don't end up with that
many segments as long as you pick a reasonable base size (e.g. 64MB).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Robert Haas
On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire  wrote:
> Attached is patch 0002 with pgindent applied over it
>
> I don't think there's any other formatting issue, but feel free to
> point a finger to it if I missed any

Hmm, I had imagined making all of the segments the same size rather
than having the size grow exponentially.  The whole point of this is
to save memory, and even in the worst case you don't end up with that
many segments as long as you pick a reasonable base size (e.g. 64MB).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Claudio Freire
On Thu, Nov 17, 2016 at 2:51 PM, Claudio Freire  wrote:
> On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada  
> wrote:
>> I glanced at the patches but the both patches don't obey the coding
>> style of PostgreSQL.
>> Please refer to [1].
>>
>> [1] 
>> http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.
>
> I thought I had. I'll go through that list to check what I missed.

Attached is patch 0002 with pgindent applied over it

I don't think there's any other formatting issue, but feel free to
point a finger to it if I missed any
From b9782cef0d971de341115bd3474d192419674219 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 2/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c | 295 +-
 1 file changed, 230 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 854bce3..dfb0612 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -91,6 +91,7 @@
  * tables.
  */
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
 
 /*
  * Before we consider skipping a page that's marked as clean in
@@ -98,6 +99,27 @@
  */
 #define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated) */
+	unsigned short padding;
+	ItemPointer dead_tuples;	/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dead_tuples;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -117,14 +139,14 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	/* List of TIDs of tuples we intend to delete */
 	/* NB: this list is ordered by TID address */
-	int			num_dead_tuples;	/* current # of entries */
-	int			max_dead_tuples;	/* # slots allocated in array */
-	ItemPointer dead_tuples;	/* array of ItemPointerData */
+	DeadTuplesMultiArray dead_tuples;
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 } LVRelStats;
 
+#define DeadTuplesCurrentSegment(lvrelstats) (&((lvrelstats)->dead_tuples.dead_tuples[ \
+	(lvrelstats)->dead_tuples.last_seg ]))
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -149,7 +171,7 @@ static void lazy_cleanup_index(Relation indrel,
    IndexBulkDeleteResult *stats,
    LVRelStats *vacrelstats);
 static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+ int tupindex, LVRelStats *vacrelstats, DeadTuplesSegment * seg, Buffer *vmbuffer);
 static bool should_attempt_truncation(LVRelStats *vacrelstats);
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
@@ -157,8 +179,9 @@ static BlockNumber count_nondeletable_pages(Relation onerel,
 static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
 static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
 	   ItemPointer itemptr);
+static void lazy_clear_dead_tuples(LVRelStats *vacrelstats);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
+static inline int vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
 	 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 
@@ -498,7 +521,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = nblocks;
-	initprog_val[2] = vacrelstats->max_dead_tuples;
+	initprog_val[2] = 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Claudio Freire
On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada  wrote:
> I glanced at the patches but the both patches don't obey the coding
> style of PostgreSQL.
> Please refer to [1].
>
> [1] 
> http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.

I thought I had. I'll go through that list to check what I missed.

>> In the results archive, the .vacuum prefix is the patched version with
>> both patch 1 and 2, .git.ref is just patch 1 (without which the
>> truncate takes unholily long).
>
> Did you measure the performance benefit of 0001 patch by comparing
> HEAD with HEAD+0001 patch?

Not the whole test, but yes. Without the 0001 patch the backward scan
step during truncate goes between 3 and 5 times slower. I don't have
timings because the test never finished without the patch. It would
have finished, but it would have taken about a day.

>> Grepping the results a bit, picking an average run out of all runs on
>> each scale:
>>
>> Timings:
>>
>> Patched:
>>
>> s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
>> s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
>> s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.
>>
>> Unpatched:
>>
>> s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
>> s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
>> s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.
>>
>> Total I/O (in MB)
>>
>> Patched:
>>
>> s100: R:2.4 - W:5862
>> s400: R:1337.4 - W:29385.6
>> s4000: R:318631 - W:370154
>>
>> Unpatched:
>>
>> s100: R:1412.4 - W:7644.6
>> s400: R:3180.6 - W:36281.4
>> s4000: R:330683 - W:370391
>>
>>
>> So, in essence, CPU time didn't get adversely affected. If anything,
>> it got improved by about 20% on the biggest case (scale 4000).
>
> And this test case deletes all tuples in relation and then measure
> duration of vacuum.
> It would not be effect much in practical use case.

Well, this patch set started because I had to do exactly that, and
realized just how inefficient vacuum was in that case.

But it doesn't mean it won't benefit more realistic use cases. Almost
any big database ends up hitting this 1GB limit because big tables
take very long to vacuum and accumulate a lot of bloat in-between
vacuums.

If you have a specific test case in mind I can try to run it.

>> While total runtime didn't change much, I believe this is only due to
>> the fact that the index is perfectly correlated (clustered?) since
>> it's a pristine index, so index scans either remove or skip full
>> pages, never leaving things half-way. A bloated index would probably
>> show a substantially different behavior, I'll try to get a script that
>> does it by running pgbench a while before the vacuum or something like
>> that.
>>
>> However, the total I/O metric already shows remarkable improvement.
>> This metric is measuring all the I/O including pgbench init, the
>> initial vacuum pgbench init does, the delete and the final vacuum. So
>> it's not just the I/O for the vacuum itself, but the whole run. We can
>> see the patched version reading a lot less (less scans over the
>> indexes will do that), and in some cases writing less too (again, less
>> index scans may be performing less redundant writes when cleaning
>> up/reclaiming index pages).
>
> What value of maintenance_work_mem did you use for this test?

4GB on both, patched and HEAD.

> Since
> DeadTuplesSegment struct still stores array of ItemPointerData(6byte)
> representing dead tuple I supposed that the frequency of index vacuum
> does not change. But according to the test result, a index vacuum is
> invoked once and removes 4 rows at the time. It means that the
> vacuum stored about 2289 MB memory during heap vacuum. On the other
> side, the result of test without 0002 patch show that a index vacuum
> remove 178956737 rows at the time, which means 1GB memory was used.

1GB is a hardcoded limit on HEAD for vacuum work mem. This patch
removes that limit and lets vacuum use all the memory the user gave to
vacuum.

In the test, in both cases, 4GB was used as maintenance_work_mem
value, but HEAD cannot use all the 4GB, and it will limit itself to
just 1GB.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Masahiko Sawada
On Thu, Oct 27, 2016 at 5:25 AM, Claudio Freire  wrote:
> On Thu, Sep 15, 2016 at 1:16 PM, Claudio Freire  
> wrote:
>> On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire  
>> wrote:
>>> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas  wrote:

 I am kind of doubtful about this whole line of investigation because
 we're basically trying pretty hard to fix something that I'm not sure
 is broken.I do agree that, all other things being equal, the TID
 lookups will probably be faster with a bitmap than with a binary
 search, but maybe not if the table is large and the number of dead
 TIDs is small, because cache efficiency is pretty important.  But even
 if it's always faster, does TID lookup speed even really matter to
 overall VACUUM performance? Claudio's early results suggest that it
 might, but maybe that's just a question of some optimization that
 hasn't been done yet.
>>>
>>> FYI, the reported impact was on CPU time, not runtime. There was no
>>> significant difference in runtime (real time), because my test is
>>> heavily I/O bound.
>>>
>>> I tested with a few small tables and there was no significant
>>> difference either, but small tables don't stress the array lookup
>>> anyway so that's expected.
>>>
>>> But on the assumption that some systems may be CPU bound during vacuum
>>> (particularly those able to do more than 300-400MB/s sequential I/O),
>>> in those cases the increased or decreased cost of lazy_tid_reaped will
>>> directly correlate to runtime. It's just none of my systems, which all
>>> run on amazon and is heavily bandwidth constrained (fastest I/O
>>> subsystem I can get my hands on does 200MB/s).
>>
>> Attached is the patch with the multiarray version.
>>
>> The tests are weird. Best case comparison over several runs, to remove
>> the impact of concurrent activity on this host (I couldn't remove all
>> background activity even when running the tests overnight, the distro
>> adds tons of crons and background cleanup tasks it would seem),
>> there's only very mild CPU impact. I'd say insignificant, as it's well
>> below the mean variance.
>
> I reran the tests on a really dedicated system, and with a script that
> captured a lot more details about the runs.
>
> The system isn't impressive, an i5 with a single consumer HD and 8GB
> RAM, but it did the job.
>
> These tests make more sense, so I bet it was the previous tests that
> were spoiled by concurrent activity on the host.
>
> Attached is the raw output of the test, the script used to create it,
> and just in case the patch set used. I believe it's the same as the
> last one I posted, just rebased.

I glanced at the patches but the both patches don't obey the coding
style of PostgreSQL.
Please refer to [1].

[1] 
http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.

>
> In the results archive, the .vacuum prefix is the patched version with
> both patch 1 and 2, .git.ref is just patch 1 (without which the
> truncate takes unholily long).

Did you measure the performance benefit of 0001 patch by comparing
HEAD with HEAD+0001 patch?

> Grepping the results a bit, picking an average run out of all runs on
> each scale:
>
> Timings:
>
> Patched:
>
> s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
> s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
> s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.
>
> Unpatched:
>
> s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
> s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
> s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.
>
> Total I/O (in MB)
>
> Patched:
>
> s100: R:2.4 - W:5862
> s400: R:1337.4 - W:29385.6
> s4000: R:318631 - W:370154
>
> Unpatched:
>
> s100: R:1412.4 - W:7644.6
> s400: R:3180.6 - W:36281.4
> s4000: R:330683 - W:370391
>
>
> So, in essence, CPU time didn't get adversely affected. If anything,
> it got improved by about 20% on the biggest case (scale 4000).

And this test case deletes all tuples in relation and then measure
duration of vacuum.
It would not be effect much in practical use case.

> While total runtime didn't change much, I believe this is only due to
> the fact that the index is perfectly correlated (clustered?) since
> it's a pristine index, so index scans either remove or skip full
> pages, never leaving things half-way. A bloated index would probably
> show a substantially different behavior, I'll try to get a script that
> does it by running pgbench a while before the vacuum or something like
> that.
>
> However, the total I/O metric already shows remarkable improvement.
> This metric is measuring all the I/O including pgbench init, the
> initial vacuum pgbench init does, the delete and the final vacuum. So
> it's not just the I/O for the vacuum itself, but the whole run. We can
> see 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-16 Thread Robert Haas
On Fri, Sep 16, 2016 at 9:47 AM, Pavan Deolasee
 wrote:
> On Fri, Sep 16, 2016 at 7:03 PM, Robert Haas  wrote:
>> On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee
>>  wrote:
>> > But I actually wonder if we are over engineering things and
>> > overestimating
>> > cost of memmove etc. How about this simpler approach:
>>
>> Don't forget that you need to handle the case where
>> maintenance_work_mem is quite small.
>
> How small? The default IIRC these days is 64MB and minimum is 1MB. I think
> we can do some special casing for very small values and ensure that things
> at the very least work and hopefully don't regress for them.

Sounds like you need to handle values as small as 1MB, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-16 Thread Pavan Deolasee
On Fri, Sep 16, 2016 at 7:03 PM, Robert Haas  wrote:

> On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee
>  wrote:
> > But I actually wonder if we are over engineering things and
> overestimating
> > cost of memmove etc. How about this simpler approach:
>
> Don't forget that you need to handle the case where
> maintenance_work_mem is quite small.
>
>
How small? The default IIRC these days is 64MB and minimum is 1MB. I think
we can do some special casing for very small values and ensure that things
at the very least work and hopefully don't regress for them.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-16 Thread Robert Haas
On Thu, Sep 15, 2016 at 11:39 PM, Pavan Deolasee
 wrote:
> But I actually wonder if we are over engineering things and overestimating
> cost of memmove etc. How about this simpler approach:

Don't forget that you need to handle the case where
maintenance_work_mem is quite small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Pavan Deolasee
On Fri, Sep 16, 2016 at 9:09 AM, Pavan Deolasee 
wrote:

>
> I also realised that we can compact the TID array in step (b) above
> because we only need to store 17 bits for block numbers (we already know
> which 1GB segment they belong to). Given that usable offsets are also just
> 13 bits, TID array needs only 4 bytes per TID instead of 6.
>
>
Actually this seems like a clear savings of at least 30% for all use cases,
at the cost of allocating in smaller chunks and doing some transformations.
But given the problem we are trying to solve, seems like a small price to
pay.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Pavan Deolasee
On Fri, Sep 16, 2016 at 12:24 AM, Claudio Freire 
wrote:

> On Thu, Sep 15, 2016 at 3:48 PM, Tomas Vondra
>  wrote:
> > For example, we always allocate the TID array as large as we can fit into
> > m_w_m, but maybe we don't need to wait with switching to the bitmap until
> > filling the whole array - we could wait as long as the bitmap fits into
> the
> > remaining part of the array, build it there and then copy it to the
> > beginning (and use the bitmap from that point).
>
> The bitmap can be created like that, but grow from the end of the
> segment backwards.
>
> So the scan can proceed until the bitmap fills the whole segment
> (filling backwards), no copy required.
>

I thought about those approaches when I suggested starting with half m_w_m.
So you fill in TIDs from one end and upon reaching half point, convert that
into bitmap (assuming stats tell you that there is significant savings with
bitmaps) but fill it from the other end. Then reset the TID array and start
filling up again. That guarantees that you can always work within available
limit.

But I actually wonder if we are over engineering things and overestimating
cost of memmove etc. How about this simpler approach:

1. Divide table in some fixed size chunks like Robert suggested. Say 1GB
2. Allocate pointer array to store a pointer to each segment. For 1TB
table, thats about 8192 bytes.
3. Allocate a bitmap which can hold MaxHeapTuplesPerPage * chunk size in
pages. For 8192 block and 1GB chunk, thats about 4.6MB. *Note*: I'm
suggesting to use a bitmap here because provisioning for worst case, fixed
size TID array will cost us 200MB+ where as a bitmap is just 4.6MB.
4. Collect dead tuples in that 1GB chunk. Also collect stats so that we
know about the most optimal representation.
5. At the end of 1GB scan, if no dead tuples found, set the chunk pointer
to NULL, move to next chunk and restart step 4. If dead tuples found, then
check:
   a. If bitmap can be further compressed by using less number of bits per
page. If so, allocate a new bitmap and compress the bitmap.
   b. If TID array will be a more compact representation. If so, allocate a
TID array of right size and convert bitmap into an array.
   c. Set chunk pointer to whichever representation we choose (of course
add headers etc to interpret the representation)
6. Continue until we consume all m_w_m or end-of-table is reached. If we
consume all m_w_m then do a round of index cleanup and restart.

I also realised that we can compact the TID array in step (b) above because
we only need to store 17 bits for block numbers (we already know which 1GB
segment they belong to). Given that usable offsets are also just 13 bits,
TID array needs only 4 bytes per TID instead of 6.

Many people are working on implementing different ideas, and I can
volunteer to write a patch on these lines unless someone wants to do that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Claudio Freire
On Thu, Sep 15, 2016 at 3:48 PM, Tomas Vondra
 wrote:
> For example, we always allocate the TID array as large as we can fit into
> m_w_m, but maybe we don't need to wait with switching to the bitmap until
> filling the whole array - we could wait as long as the bitmap fits into the
> remaining part of the array, build it there and then copy it to the
> beginning (and use the bitmap from that point).

The bitmap can be created like that, but grow from the end of the
segment backwards.

So the scan can proceed until the bitmap fills the whole segment
(filling backwards), no copy required.

I'll try that later, but first I'd like to get multiarray approach
right since that's the basis of it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Tomas Vondra



On 09/15/2016 06:40 PM, Robert Haas wrote:

On Thu, Sep 15, 2016 at 12:22 PM, Tom Lane  wrote:

Tomas Vondra  writes:

On 09/14/2016 07:57 PM, Tom Lane wrote:

People who are vacuuming because they are out of disk space will be very
very unhappy with that solution.



The people are usually running out of space for data, while these files
would be temporary files placed wherever temp_tablespaces points to. I'd
argue if this is a source of problems, the people are already in deep
trouble due to sorts, CREATE INDEX, ... as those commands may also
generate a lot of temporary files.


Except that if you are trying to recover disk space, VACUUM is what you
are doing, not CREATE INDEX.  Requiring extra disk space to perform a
vacuum successfully is exactly the wrong direction to be going in.
See for example this current commitfest entry:
https://commitfest.postgresql.org/10/649/
Regardless of what you think of the merits of that patch, it's trying
to solve a real-world problem.  And as Robert has already pointed out,
making this aspect of VACUUM more complicated is not solving any
pressing problem.  "But we made it faster" is going to be a poor answer
for the next person who finds themselves up against the wall with no
recourse.


I very much agree.



How does VACUUM alone help with recovering disk space? AFAIK it only 
makes the space available for new data, it does not reclaim the disk 
space at all. Sure, we truncate empty pages at the end of the last 
segment, but how likely is that in practice? What I do see people doing 
is usually either VACUUM FULL (which is however doomed for obvious 
reasons) or VACUUM + reindexing to get rid of index bloat (which however 
leads to CREATE INDEX using temporary files).


I'm not sure I agree with your claim there's no pressing problem. We do 
see quite a few people having to do VACUUM with multiple index scans 
(because the TIDs don't fit into m_w_m), which certainly has significant 
impact on production systems - both in terms of performance and it also 
slows down reclaiming the space. Sure, being able to set m_w_m above 1GB 
is an improvement, but perhaps using a more efficient TID storage would 
improve the situation further. Writing the TIDs to a temporary file may 
not the right approach, but I don't see why that would make the original 
problem less severe?


For example, we always allocate the TID array as large as we can fit 
into m_w_m, but maybe we don't need to wait with switching to the bitmap 
until filling the whole array - we could wait as long as the bitmap fits 
into the remaining part of the array, build it there and then copy it to 
the beginning (and use the bitmap from that point).


regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Robert Haas
On Thu, Sep 15, 2016 at 12:22 PM, Tom Lane  wrote:
> Tomas Vondra  writes:
>> On 09/14/2016 07:57 PM, Tom Lane wrote:
>>> People who are vacuuming because they are out of disk space will be very
>>> very unhappy with that solution.
>
>> The people are usually running out of space for data, while these files
>> would be temporary files placed wherever temp_tablespaces points to. I'd
>> argue if this is a source of problems, the people are already in deep
>> trouble due to sorts, CREATE INDEX, ... as those commands may also
>> generate a lot of temporary files.
>
> Except that if you are trying to recover disk space, VACUUM is what you
> are doing, not CREATE INDEX.  Requiring extra disk space to perform a
> vacuum successfully is exactly the wrong direction to be going in.
> See for example this current commitfest entry:
> https://commitfest.postgresql.org/10/649/
> Regardless of what you think of the merits of that patch, it's trying
> to solve a real-world problem.  And as Robert has already pointed out,
> making this aspect of VACUUM more complicated is not solving any
> pressing problem.  "But we made it faster" is going to be a poor answer
> for the next person who finds themselves up against the wall with no
> recourse.

I very much agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Tom Lane
Tomas Vondra  writes:
> On 09/14/2016 07:57 PM, Tom Lane wrote:
>> People who are vacuuming because they are out of disk space will be very
>> very unhappy with that solution.

> The people are usually running out of space for data, while these files 
> would be temporary files placed wherever temp_tablespaces points to. I'd 
> argue if this is a source of problems, the people are already in deep 
> trouble due to sorts, CREATE INDEX, ... as those commands may also 
> generate a lot of temporary files.

Except that if you are trying to recover disk space, VACUUM is what you
are doing, not CREATE INDEX.  Requiring extra disk space to perform a
vacuum successfully is exactly the wrong direction to be going in.
See for example this current commitfest entry:
https://commitfest.postgresql.org/10/649/
Regardless of what you think of the merits of that patch, it's trying
to solve a real-world problem.  And as Robert has already pointed out,
making this aspect of VACUUM more complicated is not solving any
pressing problem.  "But we made it faster" is going to be a poor answer
for the next person who finds themselves up against the wall with no
recourse.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Claudio Freire
On Thu, Sep 15, 2016 at 12:50 PM, Tomas Vondra
 wrote:
> On 09/14/2016 07:57 PM, Tom Lane wrote:
>>
>> Pavan Deolasee  writes:
>>>
>>> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera
>>> 
>>> wrote:

 One thing not quite clear to me is how do we create the bitmap
 representation starting from the array representation in midflight
 without using twice as much memory transiently.  Are we going to write
 the array to a temp file, free the array memory, then fill the bitmap by
 reading the array from disk?
>>
>>
>>> We could do that.
>>
>>
>> People who are vacuuming because they are out of disk space will be very
>> very unhappy with that solution.
>
>
> The people are usually running out of space for data, while these files
> would be temporary files placed wherever temp_tablespaces points to. I'd
> argue if this is a source of problems, the people are already in deep
> trouble due to sorts, CREATE INDEX, ... as those commands may also generate
> a lot of temporary files.

One would not expect "CREATE INDEX" to succeed when space is tight,
but VACUUM is quite the opposite.

Still, temporary storage could be used if available, and gracefully
fall back to some other technique (like not using bitmaps) when not.

Not sure it's worth the trouble, though.


On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire  wrote:
> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas  wrote:
>>
>> I am kind of doubtful about this whole line of investigation because
>> we're basically trying pretty hard to fix something that I'm not sure
>> is broken.I do agree that, all other things being equal, the TID
>> lookups will probably be faster with a bitmap than with a binary
>> search, but maybe not if the table is large and the number of dead
>> TIDs is small, because cache efficiency is pretty important.  But even
>> if it's always faster, does TID lookup speed even really matter to
>> overall VACUUM performance? Claudio's early results suggest that it
>> might, but maybe that's just a question of some optimization that
>> hasn't been done yet.
>
> FYI, the reported impact was on CPU time, not runtime. There was no
> significant difference in runtime (real time), because my test is
> heavily I/O bound.
>
> I tested with a few small tables and there was no significant
> difference either, but small tables don't stress the array lookup
> anyway so that's expected.
>
> But on the assumption that some systems may be CPU bound during vacuum
> (particularly those able to do more than 300-400MB/s sequential I/O),
> in those cases the increased or decreased cost of lazy_tid_reaped will
> directly correlate to runtime. It's just none of my systems, which all
> run on amazon and is heavily bandwidth constrained (fastest I/O
> subsystem I can get my hands on does 200MB/s).

Attached is the patch with the multiarray version.

The tests are weird. Best case comparison over several runs, to remove
the impact of concurrent activity on this host (I couldn't remove all
background activity even when running the tests overnight, the distro
adds tons of crons and background cleanup tasks it would seem),
there's only very mild CPU impact. I'd say insignificant, as it's well
below the mean variance.

Worst case:

DETAIL:  CPU 9.90s/80.94u sec elapsed 1232.42 sec.

Best case:

DETAIL:  CPU 12.10s/63.82u sec elapsed 832.79 sec.

There seems to be more variance with the multiarray approach than the
single array one, but I could not figure out why. Even I/O seems less
stable:

Worst case:

INFO:  "pgbench_accounts": removed 4 row versions in 6557382 pages
DETAIL:  CPU 64.31s/37.60u sec elapsed 2573.88 sec.

Best case:

INFO:  "pgbench_accounts": removed 4 row versions in 6557378 pages
DETAIL:  CPU 54.48s/31.78u sec elapsed 1552.18 sec.

Since this test takes several hours to complete, I could only run a
few runs of each version, so the statistical significance of the test
isn't very bright.

I'll try to compare with smaller pgbench scale numbers and more runs
over the weekend (gotta script that). It's certainly puzzling, I
cannot explain the increased variance, especially in I/O, since the
I/O should be exactly the same. I'm betting it's my system that's
unpredictable somehow. So I'm posting the patch in case someone gets
inspired and can spot the reason, and because there's been a lot of
talk about this very same approach, so I thought I'd better post the
code ;)

I'll also try to get a more predictable system to run the tests on.
From f85fd4213eb6c0b4da8dc9196424f6e8a5a2a9a7 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Tomas Vondra



On 09/14/2016 05:17 PM, Robert Haas wrote:

I am kind of doubtful about this whole line of investigation because
we're basically trying pretty hard to fix something that I'm not sure
is broken.I do agree that, all other things being equal, the TID
lookups will probably be faster with a bitmap than with a binary
search, but maybe not if the table is large and the number of dead
TIDs is small, because cache efficiency is pretty important.  But even
if it's always faster, does TID lookup speed even really matter to
overall VACUUM performance? Claudio's early results suggest that it
might, but maybe that's just a question of some optimization that
hasn't been done yet.


Regarding the lookup performance, I don't think the bitmap alone can 
significantly improve that - it's more efficient memory-wise, no doubt 
about that, but it's still likely larger than CPU caches and accessed 
mostly randomly (when vacuuming the indexes).


IMHO the best way to speed-up lookups (if it's really an issue, haven't 
done any benchmarks) would be to build a small bloom filter in front of 
the TID array / bitmap. It shall be fairly small (depending on the 
number of TIDs, error rate etc.) and likely to fit into L2/L3, and 
eliminate a lot of probes into the much larger array/bitmap.


Of course, it's another layer of complexity - the good thing is we don't 
need to build the filter until after we collect the TIDs, so we got 
pretty good inputs for the bloom filter parameters.


But all this is based on the assumption that the lookups are actually 
expensive, not sure about that.


regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Tomas Vondra



On 09/14/2016 07:57 PM, Tom Lane wrote:

Pavan Deolasee  writes:

On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera 
wrote:

One thing not quite clear to me is how do we create the bitmap
representation starting from the array representation in midflight
without using twice as much memory transiently.  Are we going to write
the array to a temp file, free the array memory, then fill the bitmap by
reading the array from disk?



We could do that.


People who are vacuuming because they are out of disk space will be very
very unhappy with that solution.


The people are usually running out of space for data, while these files 
would be temporary files placed wherever temp_tablespaces points to. I'd 
argue if this is a source of problems, the people are already in deep 
trouble due to sorts, CREATE INDEX, ... as those commands may also 
generate a lot of temporary files.


regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Masahiko Sawada
On Thu, Sep 15, 2016 at 2:40 AM, Simon Riggs  wrote:
> On 14 September 2016 at 11:19, Pavan Deolasee  
> wrote:
>
>>>  In
>>> theory we could even start with the list of TIDs and switch to the
>>> bitmap if the TID list becomes larger than the bitmap would have been,
>>> but I don't know if it's worth the effort.
>>>
>>
>> Yes, that works too. Or may be even better because we already know the
>> bitmap size requirements, definitely for the tuples collected so far. We
>> might need to maintain some more stats to further optimise the
>> representation, but that seems like unnecessary detailing at this point.
>
> That sounds best to me... build the simple representation, but as we
> do maintain stats to show to what extent that set of tuples is
> compressible.
>
> When we hit the limit on memory we can then selectively compress
> chunks to stay within memory, starting with the most compressible
> chunks.
>
> I think we should use the chunking approach Robert suggests, though
> mainly because that allows us to consider how parallel VACUUM should
> work - writing the chunks to shmem. That would also allow us to apply
> a single global limit for vacuum memory rather than an allocation per
> VACUUM.
> We can then scan multiple indexes at once in parallel, all accessing
> the shmem data structure.
>

Yeah, the chunking approach Robert suggested seems like a good idea
but considering implementing parallel vacuum, it would be more
complicated IMO.
I think It's better the multiple process simply allocate memory space
for its process than that the single process allocate huge memory
space using complicated way.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Robert Haas
On Wed, Sep 14, 2016 at 1:23 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> Actually, I think that probably *is* worthwhile, specifically because
>> it might let us avoid multiple index scans in cases where we currently
>> require them.  Right now, our default maintenance_work_mem value is
>> 64MB, which is enough to hold a little over ten million tuples.  It's
>> also large enough to hold a bitmap for a 14GB table.  So right now if
>> you deleted, say, 100 tuples per page you would end up with an index
>> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
>> the bitmap representation for such cases would require only one index
>> vacuum cycle for every 14GB, more than an order of magnitude
>> improvement!
>
> Yeah, this sounds worthwhile.  If we switch to the more compact
> in-memory representation close to the point where we figure the TID
> array is not going to fit in m_w_m, then we're saving some number of
> additional index scans, and I'm pretty sure that the time to transform
> from array to bitmap is going to be more than paid back by the I/O
> savings.

Yes, that seems pretty clear.  The indexes can be arbitrarily large
and there can be arbitrarily many of them, so we could be save a LOT
of I/O.

> One thing not quite clear to me is how do we create the bitmap
> representation starting from the array representation in midflight
> without using twice as much memory transiently.  Are we going to write
> the array to a temp file, free the array memory, then fill the bitmap by
> reading the array from disk?

I was just thinking about this exact problem while I was out to
lunch.[1]  I wonder if we could do something like this:

1. Allocate an array large enough for one pointer per gigabyte of the
underlying relation.

2. Allocate 64MB, or the remaining amount of maintenance_work_mem if
it's less, to store TIDs.

3. At the beginning of each 1GB chunk, add a pointer to the first free
byte in the slab allocated in step 2 to the array allocated in step 1.
Write a header word that identifies this as a TID list (rather than a
bitmap) and leave space for a TID count; then, write the TIDs
afterwards.  Continue doing this until one of the following things
happens: (a) we reach the end of the 1GB chunk - if that happens,
restart step 3 for the next chunk; (b) we fill the chunk - see step 4,
or (c) we write more TIDs for the chunk than the space being used for
TIDs now exceeds the space needed for a bitmap - see step 5.

4. When we fill up one of the slabs allocated in step 2, allocate a
new one and move the tuples for the current 1GB chunk to the beginning
of the new slab using memmove().  This is wasteful of both CPU time
and memory, but I think it's not that bad.  The maximum possible waste
is less than 10%, and many allocators have more overhead than that.
We could reduce the waste by using, say, 256MB chunks rather than 1GB
chunks.   If no new slab can be allocated because maintenance_work_mem
is completely exhausted (or the remaining space isn't enough for the
TIDs that would need to be moved immediately), then stop and do an
index vacuum cycle.

5. When we write a large enough number of TIDs for 1GB chunk that the
bitmap would be smaller, check whether sufficient maintenance_work_mem
remains to allocate a bitmap for 1GB chunk (~4.5MB).  If not, never
mind; continue with step 3 as if the bitmap representation did not
exist.  If so, allocate space for a bitmap, move all of the TIDs for
the current chunk into it, and update the array allocated in step 1 to
point to it.  Then, finish scanning the current 1GB chunk, updating
that bitmap rather than inserting TIDs into the slab.  Rewind our
pointer into the slab to where it was at the beginning of the current
1GB chunk, so that the memory we consumed for TIDs can be reused now
that those TIDs have been transferred to a bitmap.  If, earlier in the
current 1GB chunk, we did a memmove-to-next-slab operation as
described in step 4, this "rewind" might move our pointer back into
the previous slab, in which case we can free the now-empty slab.  (The
next 1GB segment might have few enough TIDs that they will fit into
the leftover space in the previous slab.)

With this algorithm, we never exceed maintenance_work_mem, not even
transiently.  When memory is no longer sufficient to convert to the
bitmap representation without bursting above maintenance_work_mem, we
simply don't perform the conversion.  Also, we do very little memory
copying.  An alternative I considered was to do a separate allocation
for each 1GB chunk rather than carving the dead-tuple space out of
slabs.  But the problem with that is that you'll have to start those
out small (in case you don't find many dead tuples) and then grow
them, which means reallocating, which is bad both because it can burst
above maintenance_work_mem while the repalloc is in process and also
because you have to keep copying the data from the old chunk to the
new, bigger chunk.  

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Tom Lane
Pavan Deolasee  writes:
> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera 
> wrote:
>> One thing not quite clear to me is how do we create the bitmap
>> representation starting from the array representation in midflight
>> without using twice as much memory transiently.  Are we going to write
>> the array to a temp file, free the array memory, then fill the bitmap by
>> reading the array from disk?

> We could do that.

People who are vacuuming because they are out of disk space will be very
very unhappy with that solution.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee
On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera 
wrote:

>
>
> One thing not quite clear to me is how do we create the bitmap
> representation starting from the array representation in midflight
> without using twice as much memory transiently.  Are we going to write
> the array to a temp file, free the array memory, then fill the bitmap by
> reading the array from disk?
>

We could do that. Or may be compress TID array when consumed half m_w_m and
do this repeatedly with remaining memory. For example, if we start with 1GB
memory, we decide to compress at 512MB. Say that results in 300MB for
bitmap. We then continue to accumulate TID and do another round of fold up
when another 350MB is consumed.

I think we should maintain per offset count of number of dead tuples to
choose the most optimal bitmap size (that needs overflow region). We can
also track how many blocks or block ranges have at least one dead tuple to
know if it's worthwhile to have some sort of indirection. Together that can
tell us how much compression can be achieved and allow us to choose the
most optimal representation.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Simon Riggs
On 14 September 2016 at 11:19, Pavan Deolasee  wrote:

>>  In
>> theory we could even start with the list of TIDs and switch to the
>> bitmap if the TID list becomes larger than the bitmap would have been,
>> but I don't know if it's worth the effort.
>>
>
> Yes, that works too. Or may be even better because we already know the
> bitmap size requirements, definitely for the tuples collected so far. We
> might need to maintain some more stats to further optimise the
> representation, but that seems like unnecessary detailing at this point.

That sounds best to me... build the simple representation, but as we
do maintain stats to show to what extent that set of tuples is
compressible.

When we hit the limit on memory we can then selectively compress
chunks to stay within memory, starting with the most compressible
chunks.

I think we should use the chunking approach Robert suggests, though
mainly because that allows us to consider how parallel VACUUM should
work - writing the chunks to shmem. That would also allow us to apply
a single global limit for vacuum memory rather than an allocation per
VACUUM.
We can then scan multiple indexes at once in parallel, all accessing
the shmem data structure.

We should also find the compression is better when we consider chunks
rather than the whole data structure at once.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Alvaro Herrera
Robert Haas wrote:

> Actually, I think that probably *is* worthwhile, specifically because
> it might let us avoid multiple index scans in cases where we currently
> require them.  Right now, our default maintenance_work_mem value is
> 64MB, which is enough to hold a little over ten million tuples.  It's
> also large enough to hold a bitmap for a 14GB table.  So right now if
> you deleted, say, 100 tuples per page you would end up with an index
> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
> the bitmap representation for such cases would require only one index
> vacuum cycle for every 14GB, more than an order of magnitude
> improvement!

Yeah, this sounds worthwhile.  If we switch to the more compact
in-memory representation close to the point where we figure the TID
array is not going to fit in m_w_m, then we're saving some number of
additional index scans, and I'm pretty sure that the time to transform
from array to bitmap is going to be more than paid back by the I/O
savings.

One thing not quite clear to me is how do we create the bitmap
representation starting from the array representation in midflight
without using twice as much memory transiently.  Are we going to write
the array to a temp file, free the array memory, then fill the bitmap by
reading the array from disk?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Claudio Freire
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas  wrote:
> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.

If you compress the list into a bitmap a posteriori, you know the
number of tuples per page, so you could encode the bitmap even more
efficiently.

It's not a bad idea, one that can be slapped on top of the multiarray
patch - when closing a segment, it can be decided whether to turn it
into a bitmap or not.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee
On Wed, Sep 14, 2016 at 8:47 PM, Robert Haas  wrote:

>
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.


Yeah, I wouldn't worry only about lookup speedup, but if does speeds up,
that's a bonus. But the bitmaps seem to win even for memory consumption. As
theory and experiments both show, at 10% dead tuple ratio, bitmaps will win
handsomely.

 In
> theory we could even start with the list of TIDs and switch to the
> bitmap if the TID list becomes larger than the bitmap would have been,
> but I don't know if it's worth the effort.
>
>
Yes, that works too. Or may be even better because we already know the
bitmap size requirements, definitely for the tuples collected so far. We
might need to maintain some more stats to further optimise the
representation, but that seems like unnecessary detailing at this point.


>
> On the other hand, if we switch to the bitmap as the ONLY possible
> representation, we will lose badly when there are scattered updates -
> e.g. 1 deleted tuple every 10 pages.


Sure. I never suggested that. What I'd suggested is to switch back to array
representation once we realise bitmaps are not going to work. But I see
it's probably better the other way round.



> So it seems like we probably
> want to have both options.  One tricky part is figuring out how we
> switch between them when memory gets tight; we have to avoid bursting
> above our memory limit while making the switch.


Yes, I was thinking about this problem. Some modelling will be necessary to
ensure that we don't go (much) beyond the maintenance_work_mem while
switching representation, which probably means you need to do that earlier
than necessary.


> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.
>
>
Yes seems like a good idea. Another idea that I'd in mind is to use some
sort of indirection map where bitmap for every block or a set of blocks
will either be recorded or not, depending on whether a bit is set for the
range. If the bitmap exists, the indirection map will give out the offset
into the larger bitmap area. Seems similar to what you described.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Arthur Silva
On Sep 14, 2016 5:18 PM, "Robert Haas"  wrote:
>
> On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolasee
>  wrote:
> > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per
page
> > bitmap size. Thats about 36 bytes for 8K page. IOW if on an average
there
> > are 6 or more dead tuples per page, bitmap will outperform the current
> > representation, assuming max allocation for bitmap. If we can use
additional
> > estimates to restrict the size to somewhat more conservative value and
then
> > keep overflow area, then probably the break-even happens even earlier
than
> > that. I hope this gives us a good starting point, but let me know if you
> > think it's still a wrong approach to pursue.
>
> Well, it's certainly a bigger change.  I think the big concern is that
> the amount of memory now becomes fixed based on the table size.  So
> one problem is that you have to figure out what you're going to do if
> the bitmap doesn't fit in maintenance_work_mem.  A related problem is
> that it might fit but use more memory than before, which could cause
> problems for some people.  Now on the other hand it could also use
> less memory for some people, and that would be good.
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.
>
> I'm fairly sure that our number one priority should be to minimize the
> number of cases where we need to do multiple scans of the indexes to
> stay within maintenance_work_mem.  If we're satisfied we've met that
> goal, then within that we should try to make VACUUM as fast as
> possible with as little memory usage as possible.  I'm not 100% sure I
> know how to get there, or how much work it's worth expending.  In
> theory we could even start with the list of TIDs and switch to the
> bitmap if the TID list becomes larger than the bitmap would have been,
> but I don't know if it's worth the effort.
>
> /me thinks a bit.
>
> Actually, I think that probably *is* worthwhile, specifically because
> it might let us avoid multiple index scans in cases where we currently
> require them.  Right now, our default maintenance_work_mem value is
> 64MB, which is enough to hold a little over ten million tuples.  It's
> also large enough to hold a bitmap for a 14GB table.  So right now if
> you deleted, say, 100 tuples per page you would end up with an index
> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
> the bitmap representation for such cases would require only one index
> vacuum cycle for every 14GB, more than an order of magnitude
> improvement!
>
> On the other hand, if we switch to the bitmap as the ONLY possible
> representation, we will lose badly when there are scattered updates -
> e.g. 1 deleted tuple every 10 pages.  So it seems like we probably
> want to have both options.  One tricky part is figuring out how we
> switch between them when memory gets tight; we have to avoid bursting
> above our memory limit while making the switch.  And even if our
> memory limit is very high, we want to avoid using memory gratuitously;
> I think we should try to grow memory usage incrementally with either
> representation.
>
> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

I'd say it's an idea worth pursuing. It's the base idea behind roaring
bitmaps, arguably the best overall compressed bitmap implementation.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Claudio Freire
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas  wrote:
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.

FYI, the reported impact was on CPU time, not runtime. There was no
significant difference in runtime (real time), because my test is
heavily I/O bound.

I tested with a few small tables and there was no significant
difference either, but small tables don't stress the array lookup
anyway so that's expected.

But on the assumption that some systems may be CPU bound during vacuum
(particularly those able to do more than 300-400MB/s sequential I/O),
in those cases the increased or decreased cost of lazy_tid_reaped will
directly correlate to runtime. It's just none of my systems, which all
run on amazon and is heavily bandwidth constrained (fastest I/O
subsystem I can get my hands on does 200MB/s).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Robert Haas
On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolasee
 wrote:
> Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page
> bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there
> are 6 or more dead tuples per page, bitmap will outperform the current
> representation, assuming max allocation for bitmap. If we can use additional
> estimates to restrict the size to somewhat more conservative value and then
> keep overflow area, then probably the break-even happens even earlier than
> that. I hope this gives us a good starting point, but let me know if you
> think it's still a wrong approach to pursue.

Well, it's certainly a bigger change.  I think the big concern is that
the amount of memory now becomes fixed based on the table size.  So
one problem is that you have to figure out what you're going to do if
the bitmap doesn't fit in maintenance_work_mem.  A related problem is
that it might fit but use more memory than before, which could cause
problems for some people.  Now on the other hand it could also use
less memory for some people, and that would be good.

I am kind of doubtful about this whole line of investigation because
we're basically trying pretty hard to fix something that I'm not sure
is broken.I do agree that, all other things being equal, the TID
lookups will probably be faster with a bitmap than with a binary
search, but maybe not if the table is large and the number of dead
TIDs is small, because cache efficiency is pretty important.  But even
if it's always faster, does TID lookup speed even really matter to
overall VACUUM performance? Claudio's early results suggest that it
might, but maybe that's just a question of some optimization that
hasn't been done yet.

I'm fairly sure that our number one priority should be to minimize the
number of cases where we need to do multiple scans of the indexes to
stay within maintenance_work_mem.  If we're satisfied we've met that
goal, then within that we should try to make VACUUM as fast as
possible with as little memory usage as possible.  I'm not 100% sure I
know how to get there, or how much work it's worth expending.  In
theory we could even start with the list of TIDs and switch to the
bitmap if the TID list becomes larger than the bitmap would have been,
but I don't know if it's worth the effort.

/me thinks a bit.

Actually, I think that probably *is* worthwhile, specifically because
it might let us avoid multiple index scans in cases where we currently
require them.  Right now, our default maintenance_work_mem value is
64MB, which is enough to hold a little over ten million tuples.  It's
also large enough to hold a bitmap for a 14GB table.  So right now if
you deleted, say, 100 tuples per page you would end up with an index
vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
the bitmap representation for such cases would require only one index
vacuum cycle for every 14GB, more than an order of magnitude
improvement!

On the other hand, if we switch to the bitmap as the ONLY possible
representation, we will lose badly when there are scattered updates -
e.g. 1 deleted tuple every 10 pages.  So it seems like we probably
want to have both options.  One tricky part is figuring out how we
switch between them when memory gets tight; we have to avoid bursting
above our memory limit while making the switch.  And even if our
memory limit is very high, we want to avoid using memory gratuitously;
I think we should try to grow memory usage incrementally with either
representation.

For instance, one idea to grow memory usage incrementally would be to
store dead tuple information separately for each 1GB segment of the
relation.  So we have an array of dead-tuple-representation objects,
one for every 1GB of the relation.  If there are no dead tuples in a
given 1GB segment, then this pointer can just be NULL.  Otherwise, it
can point to either the bitmap representation (which will take ~4.5MB)
or it can point to an array of TIDs (which will take 6 bytes/TID).
That could handle an awfully wide variety of usage patterns
efficiently; it's basically never worse than what we're doing today,
and when the dead tuple density is high for any portion of the
relation it's a lot better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee
On Wed, Sep 14, 2016 at 5:32 PM, Robert Haas  wrote:

> On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee
>  wrote:
> > Another interesting bit about these small tables is that the largest used
> > offset for these tables never went beyond 291 which is the value of
> > MaxHeapTuplesPerPage. I don't know if there is something that prevents
> > inserting more than  MaxHeapTuplesPerPage offsets per heap page and I
> don't
> > know at this point if this gives us upper limit for bits per page (may
> be it
> > does).
>
> From PageAddItemExtended:
>
> /* Reject placing items beyond heap boundary, if heap */
> if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage)
> {
> elog(WARNING, "can't put more than MaxHeapTuplesPerPage items
> in a heap page");
> return InvalidOffsetNumber;
> }
>
> Also see the comment where MaxHeapTuplesPerPage is defined:
>
>  * Note: with HOT, there could theoretically be more line pointers (not
> actual
>  * tuples) than this on a heap page.  However we constrain the number of
> line
>  * pointers to this anyway, to avoid excessive line-pointer bloat and not
>  * require increases in the size of work arrays.
>
>
Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per
page bitmap size. Thats about 36 bytes for 8K page. IOW if on an average
there are 6 or more dead tuples per page, bitmap will outperform the
current representation, assuming max allocation for bitmap. If we can use
additional estimates to restrict the size to somewhat more conservative
value and then keep overflow area, then probably the break-even happens
even earlier than that. I hope this gives us a good starting point, but let
me know if you think it's still a wrong approach to pursue.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Robert Haas
On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee
 wrote:
> Another interesting bit about these small tables is that the largest used
> offset for these tables never went beyond 291 which is the value of
> MaxHeapTuplesPerPage. I don't know if there is something that prevents
> inserting more than  MaxHeapTuplesPerPage offsets per heap page and I don't
> know at this point if this gives us upper limit for bits per page (may be it
> does).

>From PageAddItemExtended:

/* Reject placing items beyond heap boundary, if heap */
if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage)
{
elog(WARNING, "can't put more than MaxHeapTuplesPerPage items
in a heap page");
return InvalidOffsetNumber;
}

Also see the comment where MaxHeapTuplesPerPage is defined:

 * Note: with HOT, there could theoretically be more line pointers (not actual
 * tuples) than this on a heap page.  However we constrain the number of line
 * pointers to this anyway, to avoid excessive line-pointer bloat and not
 * require increases in the size of work arrays.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Pavan Deolasee
On Wed, Sep 14, 2016 at 8:47 AM, Pavan Deolasee 
wrote:

>
>>
> Sawada-san offered to reimplement the patch based on what I proposed
> upthread. In the new scheme of things, we will allocate a fixed size bitmap
> of length 2D bits per page where D is average page density of live + dead
> tuples. (The rational behind multiplying D by a factor of 2 is to consider
> worst case scenario where every tuple also has a LP_DIRECT line
> pointer). The value of D in most real world, large tables should not go
> much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That
> translates to about 25 bytes/page. So all TIDs with offset less than 2D can
> be represented by a single bit. We augment this with an overflow to track
> tuples which fall outside this limit. I believe this area will be small,
> say 10% of the total allocation.
>
>
So I cooked up the attached patch to track number of live/dead tuples found
at offset 1 to MaxOffsetNumber. The idea was to see how many tuples
actually go beyond the threshold of 2D offsets per page. Note that I am
proposing to track 2D offsets via bitmap and rest via regular TID array.

So I ran a pgbench test for 2hrs with scale factor 500. autovacuum scale
factor was set to 0.1 or 10%.

Some interesting bits:

postgres=# select relname, n_tup_ins, n_tup_upd, n_tup_hot_upd, n_live_tup,
n_dead_tup, pg_relation_size(relid)/8192 as relsize,
(n_live_tup+n_dead_tup)/(pg_relation_size(relid)/8192) as density from
pg_stat_user_tables ;
 relname  | n_tup_ins | n_tup_upd | n_tup_hot_upd | n_live_tup |
n_dead_tup | relsize | density
--+---+---+---+++-+-
 pgbench_tellers  |  5000 |  95860289 |  87701578 |   5000 |
   0 |3493 |   1
 pgbench_branches |   500 |  95860289 |  94158081 |967 |
   0 |1544 |   0
 pgbench_accounts |  5000 |  95860289 |  93062567 |   51911657 |
 3617465 |  865635 |  64
 pgbench_history  |  95860289 | 0 | 0 |   95258548 |
   0 |  610598 | 156
(4 rows)

Smaller tellers and branches tables bloat so much that the density as
computed by live + dead tuples falls close to 1 tuple/page. So for such
tables, the idea of 2D bits/page will fail miserably. But I think these
tables are worst representatives and I would be extremely surprised if we
ever find very large table bloated so much. But even then, this probably
tells us that we can't solely rely on the density measure.

Another interesting bit about these small tables is that the largest used
offset for these tables never went beyond 291 which is the value of
MaxHeapTuplesPerPage. I don't know if there is something that prevents
inserting more than  MaxHeapTuplesPerPage offsets per heap page and I don't
know at this point if this gives us upper limit for bits per page (may be
it does).

For pgbench_accounts table, the maximum offset used was 121, though bulk of
the used offsets were at the start of the page (see attached graph). Now
the test did not create enough dead tuples to trigger autovacuum on
pgbench_accounts table. So I ran a manul vacuum at the end. (There are
about 5% dead tuples in the table by the time test finished)

postgres=# VACUUM VERBOSE pgbench_accounts ;
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 2797722 row versions
DETAIL:  CPU 0.00s/9.39u sec elapsed 9.39 sec.
INFO:  "pgbench_accounts": removed 2797722 row versions in 865399 pages
DETAIL:  CPU 0.10s/7.01u sec elapsed 7.11 sec.
INFO:  index "pgbench_accounts_pkey" now contains 5000 row versions in
137099 pages
DETAIL:  2797722 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pgbench_accounts": found 852487 removable, 5000 nonremovable
row versions in 865635 out of 865635 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 802256 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.73s/27.20u sec elapsed 27.98 sec.tuple count at each offset
(offnum:all_tuples:dead_tuples)

For 2797722 dead line pointers, the current representation would have used
2797722  x 6 = 16786332 bytes of memory. The most optimal bitmap would have
used 121 bits/page x 865399 pages = 13089159 bytes where as if we had
provisioned 2D bits/page and assuming D = 64 based on the above
calculation, we would have used 13846384 bytes of memory. This is about 18%
less than the current representation. Of course, we would have allocated
some space for overflow region, which will make the difference
smaller/negligible. But the bitmaps would be extremely cheap to lookup
during index scans.

Now may be I got lucky, may be I did nor run tests long enough (though I
believe that may have worked in favour of bitmap), may be mostly HOT
updated tables are not good candidate for 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Pavan Deolasee
On Wed, Sep 14, 2016 at 12:21 AM, Robert Haas  wrote:

> On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada 
> wrote:
> > Attached PoC patch changes the representation of dead tuple locations
> > to the hashmap having tuple bitmap.
> > The one hashmap entry consists of the block number and the TID bitmap
> > of corresponding block, and the block number is the hash key of
> > hashmap.
> > Current implementation of this patch is not smart yet because each
> > hashmap entry allocates the tuple bitmap with fixed
> > size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
> > LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
> > In case where one block can store only the several tens tuples, the
> > most bits are would be waste.
> >
> > After improved this patch as you suggested, I will measure performance
> benefit.
>
>
>
> Now, if I'm reading it correctly, this patch allocates a 132-byte
> structure for every page with at least one dead tuple.  In the worst
> case where there's just one dead tuple per page, that's a 20x
> regression in memory usage.  Actually, it's more like 40x, because
> AllocSetAlloc rounds small allocation sizes up to the next-higher
> power of two, which really stings for a 132-byte allocation, and then
> adds a 16-byte header to each chunk.  But even 20x is clearly not
> good.  There are going to be lots of real-world cases where this uses
> way more memory to track the same number of dead tuples, and I'm
> guessing that benchmarking is going to reveal that it's not faster,
> either.
>
>
Sawada-san offered to reimplement the patch based on what I proposed
upthread. In the new scheme of things, we will allocate a fixed size bitmap
of length 2D bits per page where D is average page density of live + dead
tuples. (The rational behind multiplying D by a factor of 2 is to consider
worst case scenario where every tuple also has a LP_DIRECT line
pointer). The value of D in most real world, large tables should not go
much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That
translates to about 25 bytes/page. So all TIDs with offset less than 2D can
be represented by a single bit. We augment this with an overflow to track
tuples which fall outside this limit. I believe this area will be small,
say 10% of the total allocation.

This representation is at least as good the current representation if there
are at least 4-5% dead tuples. I don't think very large tables will be
vacuumed with a scale factor less than that. And assuming 10% dead tuples,
this representation will actually be much more optimal.

The idea can fail when (a) there are very few dead tuples in the table, say
less than 5% or (b) there are large number of tuples falling outside the 2D
limit. While I don't expect any of these to hold for real world, very large
tables,  (a) we can anticipate when the vacuum starts and use current
representation. (b) we can detect at run time and do a one time switch
between representation. You may argue that managing two representations is
clumsy, which I agree, but the code is completely isolated and probably not
more than a few hundred lines.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Claudio Freire
On Tue, Sep 13, 2016 at 4:06 PM, Robert Haas  wrote:
> On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire  
> wrote:
>> I've finished writing that patch, I'm in the process of testing its CPU 
>> impact.
>>
>> First test seemed to hint at a 40% increase in CPU usage, which seems
>> rather steep compared to what I expected, so I'm trying to rule out
>> some methodology error here.
>
> Hmm, wow.  That's pretty steep.  Maybe lazy_heap_reaptid() is hotter
> than I think it is, but even if it accounts for 10% of total CPU usage
> within a vacuum, which seems like an awful lot, you'd have to make it
> 4x as expensive, which also seems like an awful lot.

IIRC perf top reported a combined 45% between layz_heap_reaptid +
vac_cmp_itemptr (after patching).

vac_cmp_itemptr was around 15% on its own

Debug build of couse (I need the assertions and the debug symbols),
I'll retest with optimizations once debug tests make sense.

>>> For example, we could keep a bitmap with one bit per K
>>> pages.  If the bit is set, there is at least 1 dead tuple on that
>>> page; if clear, there are none.  When we see an index tuple, we
>>> consult the bitmap to determine whether we need to search the TID
>>> list.  We select K to be the smallest power of 2 such that the bitmap
>>> uses less memory than some threshold, perhaps 64kB.
>>
>> I've been pondering something like that, but that's an optimization
>> that's quite orthogonal to the multiarray stuff.
>
> Sure, but if this really does increase CPU time, it'd be reasonable to
> do something to decrease it again in order to get the other benefits
> of this patch - i.e. increasing the maintenance_work_mem limit while
> reducing the chances that overallocation will cause OOM.

I was hoping it wouldn't regress performance so much. I'd rather
micro-optimize the multiarray implementation until it doesn't and then
think of orthogonal optimizations.

>>>  Assuming that
>>> updates and deletes to the table have some locality, we should be able
>>> to skip a large percentage of the TID searches with a probe into this
>>> very compact bitmap.
>>
>> I don't think you can assume locality
>
> Really?  If you have a 1TB table, how many 2MB ranges of that table do
> you think will contain dead tuples for a typical vacuum?  I think most
> tables of that size are going to be mostly static, and the all-visible
> and all-frozen bits are going to be mostly set.  You *could* have
> something like a pgbench-type workload that does scattered updates
> across the entire table, but that's going to perform pretty poorly
> because you'll constantly be updating blocks that have to be pulled in
> from disk.

I have a few dozen of those in my biggest database. They do updates
and deletes all over the place and, even if they were few, they're
scattered almost uniformly.

Thing is, I think we really need to not worsen that case, which seems
rather common (almost any OLTP with a big enough user base, or a K-V
type of table, or TOAST tables).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Robert Haas
On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire  wrote:
> I've finished writing that patch, I'm in the process of testing its CPU 
> impact.
>
> First test seemed to hint at a 40% increase in CPU usage, which seems
> rather steep compared to what I expected, so I'm trying to rule out
> some methodology error here.

Hmm, wow.  That's pretty steep.  Maybe lazy_heap_reaptid() is hotter
than I think it is, but even if it accounts for 10% of total CPU usage
within a vacuum, which seems like an awful lot, you'd have to make it
4x as expensive, which also seems like an awful lot.

>> It would require
>> replacing the call to bsearch() in lazy_heap_reaptid() with an
>> open-coded implementation of bsearch, or with one bsearch to find the
>> chunk and another to find the TID within the chunk, but that shouldn't
>> be very expensive.
>
> I did a linear search to find the chunk, with exponentially growing
> chunks, and then a bsearch to find the item inside the chunk.
>
> With the typical number of segments and given the 12GB limit, the
> segment array size is well within the range that favors linear search.

Ah, OK.

>> For example, we could keep a bitmap with one bit per K
>> pages.  If the bit is set, there is at least 1 dead tuple on that
>> page; if clear, there are none.  When we see an index tuple, we
>> consult the bitmap to determine whether we need to search the TID
>> list.  We select K to be the smallest power of 2 such that the bitmap
>> uses less memory than some threshold, perhaps 64kB.
>
> I've been pondering something like that, but that's an optimization
> that's quite orthogonal to the multiarray stuff.

Sure, but if this really does increase CPU time, it'd be reasonable to
do something to decrease it again in order to get the other benefits
of this patch - i.e. increasing the maintenance_work_mem limit while
reducing the chances that overallocation will cause OOM.

>>  Assuming that
>> updates and deletes to the table have some locality, we should be able
>> to skip a large percentage of the TID searches with a probe into this
>> very compact bitmap.
>
> I don't think you can assume locality

Really?  If you have a 1TB table, how many 2MB ranges of that table do
you think will contain dead tuples for a typical vacuum?  I think most
tables of that size are going to be mostly static, and the all-visible
and all-frozen bits are going to be mostly set.  You *could* have
something like a pgbench-type workload that does scattered updates
across the entire table, but that's going to perform pretty poorly
because you'll constantly be updating blocks that have to be pulled in
from disk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Claudio Freire
On Tue, Sep 13, 2016 at 3:51 PM, Robert Haas  wrote:
> On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada  wrote:
>> Attached PoC patch changes the representation of dead tuple locations
>> to the hashmap having tuple bitmap.
>> The one hashmap entry consists of the block number and the TID bitmap
>> of corresponding block, and the block number is the hash key of
>> hashmap.
>> Current implementation of this patch is not smart yet because each
>> hashmap entry allocates the tuple bitmap with fixed
>> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
>> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
>> In case where one block can store only the several tens tuples, the
>> most bits are would be waste.
>>
>> After improved this patch as you suggested, I will measure performance 
>> benefit.
>
> We also need to consider the amount of memory gets used.  What I
> proposed - replacing the array with an array of arrays - would not
> increase memory utilization significantly.  I don't think it would
> have much impact on CPU utilization either.

I've finished writing that patch, I'm in the process of testing its CPU impact.

First test seemed to hint at a 40% increase in CPU usage, which seems
rather steep compared to what I expected, so I'm trying to rule out
some methodology error here.

> It would require
> replacing the call to bsearch() in lazy_heap_reaptid() with an
> open-coded implementation of bsearch, or with one bsearch to find the
> chunk and another to find the TID within the chunk, but that shouldn't
> be very expensive.

I did a linear search to find the chunk, with exponentially growing
chunks, and then a bsearch to find the item inside the chunk.

With the typical number of segments and given the 12GB limit, the
segment array size is well within the range that favors linear search.

> For example, we could keep a bitmap with one bit per K
> pages.  If the bit is set, there is at least 1 dead tuple on that
> page; if clear, there are none.  When we see an index tuple, we
> consult the bitmap to determine whether we need to search the TID
> list.  We select K to be the smallest power of 2 such that the bitmap
> uses less memory than some threshold, perhaps 64kB.

I've been pondering something like that, but that's an optimization
that's quite orthogonal to the multiarray stuff.

>  Assuming that
> updates and deletes to the table have some locality, we should be able
> to skip a large percentage of the TID searches with a probe into this
> very compact bitmap.

I don't think you can assume locality


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Peter Geoghegan
On Tue, Sep 13, 2016 at 11:51 AM, Robert Haas  wrote:
> I think it's probably wrong to worry that an array-of-arrays is going
> to be meaningfully slower than a single array here.  It's basically
> costing you some small number of additional memory references per
> tuple, which I suspect isn't all that relevant for a bulk operation
> that does I/O, writes WAL, locks buffers, etc.

This analysis makes perfect sense to me.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Robert Haas
On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada  wrote:
> Attached PoC patch changes the representation of dead tuple locations
> to the hashmap having tuple bitmap.
> The one hashmap entry consists of the block number and the TID bitmap
> of corresponding block, and the block number is the hash key of
> hashmap.
> Current implementation of this patch is not smart yet because each
> hashmap entry allocates the tuple bitmap with fixed
> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
> In case where one block can store only the several tens tuples, the
> most bits are would be waste.
>
> After improved this patch as you suggested, I will measure performance 
> benefit.

We also need to consider the amount of memory gets used.  What I
proposed - replacing the array with an array of arrays - would not
increase memory utilization significantly.  I don't think it would
have much impact on CPU utilization either.  It would require
replacing the call to bsearch() in lazy_heap_reaptid() with an
open-coded implementation of bsearch, or with one bsearch to find the
chunk and another to find the TID within the chunk, but that shouldn't
be very expensive.  For one thing, if the array chunks are around the
size I proposed (64MB), you've got more than ten million tuples per
chunk, so you can't have very many chunks unless your table is both
really large and possessed of quite a bit of dead stuff.

Now, if I'm reading it correctly, this patch allocates a 132-byte
structure for every page with at least one dead tuple.  In the worst
case where there's just one dead tuple per page, that's a 20x
regression in memory usage.  Actually, it's more like 40x, because
AllocSetAlloc rounds small allocation sizes up to the next-higher
power of two, which really stings for a 132-byte allocation, and then
adds a 16-byte header to each chunk.  But even 20x is clearly not
good.  There are going to be lots of real-world cases where this uses
way more memory to track the same number of dead tuples, and I'm
guessing that benchmarking is going to reveal that it's not faster,
either.

I think it's probably wrong to worry that an array-of-arrays is going
to be meaningfully slower than a single array here.  It's basically
costing you some small number of additional memory references per
tuple, which I suspect isn't all that relevant for a bulk operation
that does I/O, writes WAL, locks buffers, etc.  But if it is relevant,
then I think there are other ways to buy that performance back which
are likely to be more memory efficient than converting this to use a
hash table.  For example, we could keep a bitmap with one bit per K
pages.  If the bit is set, there is at least 1 dead tuple on that
page; if clear, there are none.  When we see an index tuple, we
consult the bitmap to determine whether we need to search the TID
list.  We select K to be the smallest power of 2 such that the bitmap
uses less memory than some threshold, perhaps 64kB.  Assuming that
updates and deletes to the table have some locality, we should be able
to skip a large percentage of the TID searches with a probe into this
very compact bitmap.  Note that we can set K = 1 for tables up to 4GB
in size, and even a 1TB table only needs K = 256.  Odds are very good
that a 1TB table being vacuumed has many 256-page ranges containing no
dead tuples at all ... and if this proves to be false and the dead
tuples are scattered uniformly throughout the table, then you should
probably be more worried about the fact that you're dumping a bare
minimum of 4GB of random I/O on your hapless disk controller than
about how efficient the TID search is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-09 Thread Masahiko Sawada
On Fri, Sep 9, 2016 at 12:33 PM, Pavan Deolasee
 wrote:
>
>
> On Thu, Sep 8, 2016 at 11:40 PM, Masahiko Sawada 
> wrote:
>>
>>
>>
>> Making the vacuum possible to choose between two data representations
>> sounds good.
>> I implemented the patch that changes dead tuple representation to bitmap
>> before.
>> I will measure the performance of bitmap representation again and post
>> them.
>
>
> Sounds great! I haven't seen your patch, but what I would suggest is to
> compute page density (D) = relpages/(dead+live tuples) and experiment with
> bitmap of sizes of D to 2D bits per page. May I also suggest that instead of
> putting in efforts in implementing the overflow area,  just count how many
> dead TIDs would fall under overflow area for a given choice of bitmap size.
>

Isn't that formula "page density (D) = (dead+live tuples)/relpages"?

> It might be a good idea to experiment with different vacuum scale factor,
> varying between 2% to 20% (may be 2, 5, 10, 20). You can probably run a
> longish pgbench test on a large table and then save the data directory for
> repeated experiments, although I'm not sure if pgbench will be a good choice
> because HOT will prevent accumulation of dead pointers, in which case you
> may try adding another index on abalance column.

Thank you, I will experiment with this.

>
> It'll be worth measuring memory consumption of both representations as well
> as performance implications on index vacuum. I don't expect to see any major
> difference in either heap scans.
>

Yeah, it would be effective for the index vacuum speed and the number
of execution of index vacuum.

Attached PoC patch changes the representation of dead tuple locations
to the hashmap having tuple bitmap.
The one hashmap entry consists of the block number and the TID bitmap
of corresponding block, and the block number is the hash key of
hashmap.
Current implementation of this patch is not smart yet because each
hashmap entry allocates the tuple bitmap with fixed
size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
In case where one block can store only the several tens tuples, the
most bits are would be waste.

After improved this patch as you suggested, I will measure performance benefit.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From d1968d625ca1bb07711681a2d824737c53d27c8c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Thu, 8 Sep 2016 13:59:23 -0400
Subject: [PATCH] Collect dead tuple location as a bitmap.

---
 src/backend/commands/vacuumlazy.c | 232 +-
 1 file changed, 153 insertions(+), 79 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c 
b/src/backend/commands/vacuumlazy.c
index b5fb325..ce7bd7e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -98,6 +98,28 @@
  */
 #define SKIP_PAGES_THRESHOLD   ((BlockNumber) 32)
 
+#define BITS_PER_BITMAP_CHUNK 32
+#define BITMAP_CHUNKS_PER_PAGE \
+   ((int) ((LAZY_ALLOC_TUPLES / BITS_PER_BITMAP_CHUNK) + 1))
+#define BitoffsetToOffsetNumber(chunk, offset) \
+   (((chunk) * BITS_PER_BITMAP_CHUNK) + (offset) + 1)
+#define OffsetNumberToChunk(offset) \
+   ((offset) / BITS_PER_BITMAP_CHUNK)
+#define OffsetNumberToBitoffset(offset) \
+   ((offset) % BITS_PER_BITMAP_CHUNK)
+
+typedef struct PageEntry
+{
+   BlockNumber blockno;
+   uint32  bm[BITMAP_CHUNKS_PER_PAGE]; /* tuple bitmap of blockno */
+} PageEntry;
+
+typedef struct DeadTupleMap
+{
+   HTAB*pagetable;
+   int npages; /* # of blocks hashmap has */
+} DeadTupleMap;
+
 typedef struct LVRelStats
 {
/* hasindex = true means two-pass strategy; false means one-pass */
@@ -120,6 +142,9 @@ typedef struct LVRelStats
int num_dead_tuples;/* current # of entries 
*/
int max_dead_tuples;/* # slots allocated in 
array */
ItemPointer dead_tuples;/* array of ItemPointerData */
+
+   DeadTupleMap *dead_tuple_map; /* hash map if dead ItemPointerData */
+
int num_index_scans;
TransactionId latestRemovedXid;
boollock_waiter_detected;
@@ -148,20 +173,19 @@ static void lazy_vacuum_index(Relation indrel,
 static void lazy_cleanup_index(Relation indrel,
   IndexBulkDeleteResult *stats,
   LVRelStats *vacrelstats);
-static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-int tupindex, LVRelStats *vacrelstats, Buffer 
*vmbuffer);
+static void lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
+PageEntry *entry, LVRelStats *vacrelstats, 
Buffer *vmbuffer);
 static bool 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Pavan Deolasee
On Thu, Sep 8, 2016 at 11:40 PM, Masahiko Sawada 
wrote:

>
>
> Making the vacuum possible to choose between two data representations
> sounds good.
> I implemented the patch that changes dead tuple representation to bitmap
> before.
> I will measure the performance of bitmap representation again and post
> them.


Sounds great! I haven't seen your patch, but what I would suggest is to
compute page density (D) = relpages/(dead+live tuples) and experiment with
bitmap of sizes of D to 2D bits per page. May I also suggest that instead
of putting in efforts in implementing the overflow area,  just count how
many dead TIDs would fall under overflow area for a given choice of bitmap
size.

It might be a good idea to experiment with different vacuum scale factor,
varying between 2% to 20% (may be 2, 5, 10, 20). You can probably run a
longish pgbench test on a large table and then save the data directory for
repeated experiments, although I'm not sure if pgbench will be a good
choice because HOT will prevent accumulation of dead pointers, in which
case you may try adding another index on abalance column.

It'll be worth measuring memory consumption of both representations as well
as performance implications on index vacuum. I don't expect to see any
major difference in either heap scans.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Masahiko Sawada
On Thu, Sep 8, 2016 at 11:54 PM, Pavan Deolasee
 wrote:
>
>
> On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire 
> wrote:
>>
>> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark  wrote:
>> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs 
>> > wrote:
>> >> On 6 September 2016 at 19:59, Tom Lane  wrote:
>> >>
>> >>> The idea of looking to the stats to *guess* about how many tuples are
>> >>> removable doesn't seem bad at all.  But imagining that that's going to
>> >>> be
>> >>> exact is folly of the first magnitude.
>> >>
>> >> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
>> >> so I think we agree that a reasonably accurate, yet imprecise
>> >> calculation is possible in most cases.
>> >
>> > That would all be well and good if it weren't trivial to do what
>> > Robert suggested. This is just a large unsorted list that we need to
>> > iterate throught. Just allocate chunks of a few megabytes and when
>> > it's full allocate a new chunk and keep going. There's no need to get
>> > tricky with estimates and resizing and whatever.
>>
>> I agree. While the idea of estimating the right size sounds promising
>> a priori, considering the estimate can go wrong and over or
>> underallocate quite severely, the risks outweigh the benefits when you
>> consider the alternative of a dynamic allocation strategy.
>>
>> Unless the dynamic strategy has a bigger CPU impact than expected, I
>> believe it's a superior approach.
>>
>
> How about a completely different representation for the TID array? Now this
> is probably not something new, but I couldn't find if the exact same idea
> was discussed before. I also think it's somewhat orthogonal to what we are
> trying to do here, and will probably be a bigger change. But I thought I'll
> mention since we are at the topic.
>
> What I propose is to use a simple bitmap to represent the tuples. If a tuple
> at  is dead then the corresponding bit in the bitmap is set.
> So clearly the searches through dead tuples are O(1) operations, important
> for very large tables and large arrays.
>
> Challenge really is that a heap page can theoretically have MaxOffsetNumber
> of line pointers (or to be precise maximum possible offset number). For a 8K
> block, that comes be about 2048. Having so many bits per page is neither
> practical nor optimal. But in practice the largest offset on a heap page
> should not be significantly greater than MaxHeapTuplesPerPage, which is a
> more reasonable value of 291 on my machine. Again, that's with zero sized
> tuple and for real life large tables, with much wider tuples, the number may
> go down even further.
>
> So we cap the offsets represented in the bitmap to some realistic value,
> computed by looking at page density and then multiplying it by a small
> factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line
> pointers. That should practically represent majority of the dead tuples in
> the table, but we then keep an overflow area to record tuples beyond the
> limit set for per page. The search routine will do a direct lookup for
> offsets less than the limit and search in the sorted overflow area for
> offsets beyond the limit.
>
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.
>
> I guess we can make a choice between two representations at the start
> looking at various table stats. We can also be smart and change from bitmap
> to traditional representation as we scan the table and see many more tuples
> in the overflow region than we provisioned for. There will be some
> challenges in converting representation mid-way, especially in terms of
> memory allocation, but I think those can be sorted out if we think that the
> idea has merit.
>

Making the vacuum possible to choose between two data representations
sounds good.
I implemented the patch that changes dead tuple representation to bitmap before.
I will measure the performance of bitmap representation again and post them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Pavan Deolasee
On Thu, Sep 8, 2016 at 8:42 PM, Claudio Freire 
wrote:

> On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee
>  wrote:
> > For example, for a table with 60 bytes wide tuple (including 24 byte
> > header), each page can approximately have 8192/60 = 136 tuples. Say we
> > provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> > bitmap. First 272 offsets in every page are represented in the bitmap and
> > anything greater than are in overflow region. On the other hand, the
> current
> > representation will need about 16 bytes per page assuming 2% dead
> tuples, 40
> > bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> > tuples. So bitmap will take more space for small tuples or when vacuum is
> > run very aggressively, both seems unlikely for very large tables. Of
> course
> > the calculation does not take into account the space needed by the
> overflow
> > area, but I expect that too be small.
>
> I thought about something like this, but it could be extremely
> inefficient for mostly frozen tables, since the bitmap cannot account
> for frozen pages without losing the O(1) lookup characteristic
>

Well, that's correct. But I thought the whole point is when there are large
number of dead tuples which requires large memory. If my math was correct
as explained above, then even at 5% dead tuples, bitmap representation will
consume approximate same memory but provide O(1) search time.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


  1   2   >