Re: [HACKERS] Analyzing foreign tables memory problems

2012-05-15 Thread Noah Misch
On Mon, May 14, 2012 at 09:21:20AM +0200, Albe Laurenz wrote:
 Noah Misch wrote:
  Just thinking out loud, we could provide an extern Datum
 AnalyzeWideValue;
  and direct FDW authors to use that particular datum.  It could look
 like a
  toasted datum of external size WIDTH_THRESHOLD+1 but bear
 va_toastrelid ==
  InvalidOid.  Then, if future code movement leads us to actually
 examine one of
  these values, we'll get an early, hard error.
 
 That would be very convenient indeed.
 
 Even better would be a function
 extern Datum createAnalyzeWideValue(integer width)
 so that row width calculations could be more accurate.

Yes; good call.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-05-14 Thread Albe Laurenz
Noah Misch wrote:
 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add
documentation
so that the authors of foreign data wrappers are aware of the
problem and can avoid it on their side.
This would be quite simple.

 Seems reasonable.  How would the FDW return an indication that a
value was
 non-NULL but removed due to excess width?

 The FDW would return a value of length WIDTH_THRESHOLD+1 that is
 long enough to be recognized as too long, but not long enough to
 cause a problem.

 Here is a simple patch for that.
 
 It feels to me like a undue hack to ask FDW authors to synthesize such
values.
 It's easy enough for data types such as text/bytea.  In general,
though,
 simple truncation may not produce a valid value of the type.  That
shouldn't
 matter, since the next action taken on the value should be to discard
it, but
 it's fragile.  Can we do better?
 
 Just thinking out loud, we could provide an extern Datum
AnalyzeWideValue;
 and direct FDW authors to use that particular datum.  It could look
like a
 toasted datum of external size WIDTH_THRESHOLD+1 but bear
va_toastrelid ==
 InvalidOid.  Then, if future code movement leads us to actually
examine one of
 these values, we'll get an early, hard error.

That would be very convenient indeed.

Even better would be a function
extern Datum createAnalyzeWideValue(integer width)
so that row width calculations could be more accurate.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-05-13 Thread Noah Misch
On Wed, May 02, 2012 at 12:20:39PM +0200, Albe Laurenz wrote:
  1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
 so that the authors of foreign data wrappers are aware of the
 problem and can avoid it on their side.
 This would be quite simple.
 
  Seems reasonable.  How would the FDW return an indication that a
 value was
  non-NULL but removed due to excess width?
  
  The FDW would return a value of length WIDTH_THRESHOLD+1 that is
  long enough to be recognized as too long, but not long enough to
  cause a problem.
 
 Here is a simple patch for that.

It feels to me like a undue hack to ask FDW authors to synthesize such values.
It's easy enough for data types such as text/bytea.  In general, though,
simple truncation may not produce a valid value of the type.  That shouldn't
matter, since the next action taken on the value should be to discard it, but
it's fragile.  Can we do better?

Just thinking out loud, we could provide an extern Datum AnalyzeWideValue;
and direct FDW authors to use that particular datum.  It could look like a
toasted datum of external size WIDTH_THRESHOLD+1 but bear va_toastrelid ==
InvalidOid.  Then, if future code movement leads us to actually examine one of
these values, we'll get an early, hard error.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-05-02 Thread Albe Laurenz
I wrote:
 Noah Misch wrote:
 During ANALYZE, in analyze.c, functions compute_minimal_stats
 and compute_scalar_stats, values whose length exceed
 WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
 other than that they are counted as too wide rows and assumed
 to be all different.

 This works fine with regular tables;

 With foreign tables the situation is different.  Even though
 values exceeding WIDTH_THRESHOLD won't get used, the complete
 rows will be fetched from the foreign table.  This can easily
 exhaust maintenance_work_mem.

 I can think of two remedies:
 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
so that the authors of foreign data wrappers are aware of the
problem and can avoid it on their side.
This would be quite simple.

 Seems reasonable.  How would the FDW return an indication that a
value was
 non-NULL but removed due to excess width?
 
 The FDW would return a value of length WIDTH_THRESHOLD+1 that is
 long enough to be recognized as too long, but not long enough to
 cause a problem.

Here is a simple patch for that.

Yours,
Laurenz Albe


analyze.patch
Description: analyze.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-04-30 Thread Tom Lane
Albe Laurenz laurenz.a...@wien.gv.at writes:
 During ANALYZE, in analyze.c, functions compute_minimal_stats
 and compute_scalar_stats, values whose length exceed
 WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
 other than that they are counted as too wide rows and assumed
 to be all different.

 This works fine with regular tables; values exceeding that threshold
 don't get detoasted and won't consume excessive memory.

 With foreign tables the situation is different.  Even though
 values exceeding WIDTH_THRESHOLD won't get used, the complete
 rows will be fetched from the foreign table.  This can easily
 exhaust maintenance_work_mem.

I'm fairly skeptical that this is a real problem, and would prefer not
to complicate wrappers until we see some evidence from the field that
it's worth worrying about.  The WIDTH_THRESHOLD logic was designed a
dozen years ago when common settings for work_mem were a lot smaller
than today.  Moreover, to my mind it's always been about avoiding
detoasting operations as much as saving memory, and we don't have
anything equivalent to that consideration in foreign data wrappers.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-04-30 Thread Albe Laurenz
Tom Lane wrote:
 During ANALYZE, in analyze.c, functions compute_minimal_stats
 and compute_scalar_stats, values whose length exceed
 WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
 other than that they are counted as too wide rows and assumed
 to be all different.
 
 This works fine with regular tables; values exceeding that threshold
 don't get detoasted and won't consume excessive memory.
 
 With foreign tables the situation is different.  Even though
 values exceeding WIDTH_THRESHOLD won't get used, the complete
 rows will be fetched from the foreign table.  This can easily
 exhaust maintenance_work_mem.

 I'm fairly skeptical that this is a real problem, and would prefer not
 to complicate wrappers until we see some evidence from the field that
 it's worth worrying about.  The WIDTH_THRESHOLD logic was designed a
 dozen years ago when common settings for work_mem were a lot smaller
 than today.  Moreover, to my mind it's always been about avoiding
 detoasting operations as much as saving memory, and we don't have
 anything equivalent to that consideration in foreign data wrappers.

If I have a table with 10 rows and default_statistics_target
at 100, then a sample of 3 rows will be taken.

If each row contains binary data of 1MB (an Image), then the
data structure returned will use about 30 GB of memory, which
will probably exceed maintenance_work_mem.

Or is there a flaw in my reasoning?

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-04-30 Thread Tom Lane
Albe Laurenz laurenz.a...@wien.gv.at writes:
 Tom Lane wrote:
 I'm fairly skeptical that this is a real problem, and would prefer not
 to complicate wrappers until we see some evidence from the field that
 it's worth worrying about.

 If I have a table with 10 rows and default_statistics_target
 at 100, then a sample of 3 rows will be taken.

 If each row contains binary data of 1MB (an Image), then the
 data structure returned will use about 30 GB of memory, which
 will probably exceed maintenance_work_mem.

 Or is there a flaw in my reasoning?

Only that I don't believe this is a real-world scenario for a foreign
table.  If you have a foreign table in which all, or even many, of the
rows are that wide, its performance is going to suck so badly that
you'll soon look for a different schema design anyway.

I don't want to complicate FDWs for this until it's an actual bottleneck
in real applications, which it may never be, and certainly won't be
until we've gone through a few rounds of performance refinement for
basic operations.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-04-30 Thread Simon Riggs
On Mon, Apr 30, 2012 at 3:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Albe Laurenz laurenz.a...@wien.gv.at writes:
 During ANALYZE, in analyze.c, functions compute_minimal_stats
 and compute_scalar_stats, values whose length exceed
 WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
 other than that they are counted as too wide rows and assumed
 to be all different.

 This works fine with regular tables; values exceeding that threshold
 don't get detoasted and won't consume excessive memory.

 With foreign tables the situation is different.  Even though
 values exceeding WIDTH_THRESHOLD won't get used, the complete
 rows will be fetched from the foreign table.  This can easily
 exhaust maintenance_work_mem.

 I'm fairly skeptical that this is a real problem

AFAIK its not possible to select all columns from an Oracle database.
If you use an unqualified LONG column as part of the query then you
get an error.

So there are issues with simply requesting data for analysis.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-04-30 Thread Noah Misch
On Mon, Apr 30, 2012 at 12:27:45PM +0200, Albe Laurenz wrote:
 During ANALYZE, in analyze.c, functions compute_minimal_stats
 and compute_scalar_stats, values whose length exceed
 WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
 other than that they are counted as too wide rows and assumed
 to be all different.
 
 This works fine with regular tables; values exceeding that threshold
 don't get detoasted and won't consume excessive memory.
 
 With foreign tables the situation is different.  Even though
 values exceeding WIDTH_THRESHOLD won't get used, the complete
 rows will be fetched from the foreign table.  This can easily
 exhaust maintenance_work_mem.
 
 A foreign data wrapper has no good way to counter the problem.
 It can return truncated values in ist AcquireSampleRowsFunc,
 but WIDTH_THRESHOLD is private to analyze.c and it's a bad idea
 to hard code a cutoff limit of 1025.
 
 I can think of two remedies:
 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
so that the authors of foreign data wrappers are aware of the
problem and can avoid it on their side.
This would be quite simple.

Seems reasonable.  How would the FDW return an indication that a value was
non-NULL but removed due to excess width?

Not all databases can cheaply filter out wide column values; by the time the
remote side has an exact width, the remote I/O damage may already be done.  To
dodge that problem, when a column has SET STATISTICS 0, the FDW should be
able to completely omit reading it.  (I haven't studied the API needs, if any,
to make that possible.)

 2) Instead of one callback that returns all sample rows, have
a callback that just returns the next table row (or the next
table row from a subset of the pages of the table in the
internal case).  This function could be called in a short-lived
memory context.  Vitter's algorithm for selecting a sample
and the truncation of excessively long values would then be
handled in analyze.c.
This would avoid the problem completely and make it easier
to write a foreign data wrapper.
I haven't thought this out completely, and it would require
bigger changes to analyze.c and the API than are probably
welcome this close to beta.

This solves the (in your downthread example) 30 GiB of memory consumption, but
you'll still read 30 GiB on the remote side and ship it all over the network.
To call this fixed, we'll need something like (1) that lets the FDW limit
volume at the remote side.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-04-30 Thread Albe Laurenz
Simon Riggs wrote:
 During ANALYZE, in analyze.c, functions compute_minimal_stats
 and compute_scalar_stats, values whose length exceed
 WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
 other than that they are counted as too wide rows and assumed
 to be all different.

 This works fine with regular tables; values exceeding that threshold
 don't get detoasted and won't consume excessive memory.

 With foreign tables the situation is different.  Even though
 values exceeding WIDTH_THRESHOLD won't get used, the complete
 rows will be fetched from the foreign table.  This can easily
 exhaust maintenance_work_mem.

 I'm fairly skeptical that this is a real problem

 AFAIK its not possible to select all columns from an Oracle database.
 If you use an unqualified LONG column as part of the query then you
 get an error.

 So there are issues with simply requesting data for analysis.

To detail on the specific case of Oracle, I have given up on LONG
since a) it has been deprecated for a long time and
b) it is not possible to retrieve a LONG column unless you know
in advance how long it is.

But you can have several BLOB and CLOB columns in a table, each
of which can be arbitrarily large and can lead to the problem
I described.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-04-30 Thread Albe Laurenz
Tom Lane wrote:
 I'm fairly skeptical that this is a real problem, and would prefer not
 to complicate wrappers until we see some evidence from the field that
 it's worth worrying about.

 If I have a table with 10 rows and default_statistics_target
 at 100, then a sample of 3 rows will be taken.

 If each row contains binary data of 1MB (an Image), then the
 data structure returned will use about 30 GB of memory, which
 will probably exceed maintenance_work_mem.

 Or is there a flaw in my reasoning?

 Only that I don't believe this is a real-world scenario for a foreign
 table.  If you have a foreign table in which all, or even many, of the
 rows are that wide, its performance is going to suck so badly that
 you'll soon look for a different schema design anyway.

Of course it wouldn't work well to SELECT * from such a foreign table,
but it would work well enough to get one or a few rows at a time,
which is probably such a table's purpose in life anyway.

 I don't want to complicate FDWs for this until it's an actual bottleneck
 in real applications, which it may never be, and certainly won't be
 until we've gone through a few rounds of performance refinement for
 basic operations.

I agree that it may not be the right thing to do something invasive
to solve an anticipated problem that may never be one.

So scrap my second idea.  But I think that exposing WIDTH_THRESHOLD
wouldn't be unreasonable, would it?

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables memory problems

2012-04-30 Thread Albe Laurenz
Noah Misch wrote:
 During ANALYZE, in analyze.c, functions compute_minimal_stats
 and compute_scalar_stats, values whose length exceed
 WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
 other than that they are counted as too wide rows and assumed
 to be all different.
 
 This works fine with regular tables;

 With foreign tables the situation is different.  Even though
 values exceeding WIDTH_THRESHOLD won't get used, the complete
 rows will be fetched from the foreign table.  This can easily
 exhaust maintenance_work_mem.

 I can think of two remedies:
 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
so that the authors of foreign data wrappers are aware of the
problem and can avoid it on their side.
This would be quite simple.

 Seems reasonable.  How would the FDW return an indication that a value was
 non-NULL but removed due to excess width?

The FDW would return a value of length WIDTH_THRESHOLD+1 that is
long enough to be recognized as too long, but not long enough to
cause a problem.

 Not all databases can cheaply filter out wide column values; by the time the
 remote side has an exact width, the remote I/O damage may already be done.  To
 dodge that problem, when a column has SET STATISTICS 0, the FDW should be
 able to completely omit reading it.  (I haven't studied the API needs, if any,
 to make that possible.)

Depending on the capabilities of the remote side, a FDW can
do more or less intelligent things to avoid the problem.
But it must know WIDTH_THRESHOLD.

Disabling statistics for a column as a workaround is an
interesting idea, but would be more work for the FDW writer
and the user.

 2) Instead of one callback that returns all sample rows, have
a callback that just returns the next table row (or the next
table row from a subset of the pages of the table in the
internal case).  This function could be called in a short-lived
memory context.  Vitter's algorithm for selecting a sample
and the truncation of excessively long values would then be
handled in analyze.c.
This would avoid the problem completely and make it easier
to write a foreign data wrapper.

 This solves the (in your downthread example) 30 GiB of memory consumption, but
 you'll still read 30 GiB on the remote side and ship it all over the network.
 To call this fixed, we'll need something like (1) that lets the FDW limit
 volume at the remote side.

You are right.  I guess the first idea is the more promising one.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers