Re: [HACKERS] [NOVICE] systable_getnext_ordered

2011-02-01 Thread YAMAMOTO Takashi
hi,

 I wrote:
 y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes:
 after systable_getnext_ordered returned NULL, is it ok to call it again?
 
 I wouldn't rely on it working.
 
 i'm wondering because inv_truncate seems to do it and expecting NULL.
 
 Hmm, that may well be a bug.  Have you tested it?
 
 I looked at this a bit more closely, and basically the answer is that
 inv_truncate is accidentally failing to fail.  What will actually happen
 if systable_getnext_ordered is called another time, at least when using
 a btree index, is that it will start the same requested scan over again,
 ie deliver all the tuples it did the first time (which is none, in this
 case).  That's an implementation artifact, but since the behavior is
 undefined in the first place, it's not wrong.
 
 Now, if inv_truncate's initial call on systable_getnext_ordered returns
 NULL (ie, the truncation point is past the current EOF page), it will
 fall through to the Write a brand new page code, which will create and
 insert a partial page at the truncation point.  It then goes to the
 delete-all-remaining-pages loop.  Because that starts a fresh scan with
 the very same scan key conditions, you might expect that it would find
 and delete the page it just inserted --- causing the apparent EOF of the
 blob to be wrong afterwards.  It accidentally fails to do that because
 the new tuple postdates the snapshot it's scanning with.  So the loop
 terminates having found no matching tuples, and all is well.
 
 So this code is confusing, inefficient (performing a useless search of
 the index), only works because of an obscure consideration not explained
 in the comments, and sets a bad precedent for people to follow.  I'm
 going to go change it to explicitly not do the final loop if the initial
 search failed.  It's not a bug, exactly, but it's sure lousy coding.
 Thanks for pointing it out.

thanks for the quick investigation and fix.

the attached patch is to avoid unnecessary detoast'ing and EOF marker pages
when possible.  does it make sense?

YAMAMOTO Takashi

 
   regards, tom lane
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
diff --git a/src/backend/storage/large_object/inv_api.c 
b/src/backend/storage/large_object/inv_api.c
index 01e3492..60a9c69 100644
--- a/src/backend/storage/large_object/inv_api.c
+++ b/src/backend/storage/large_object/inv_api.c
@@ -178,10 +178,14 @@ myLargeObjectExists(Oid loid, Snapshot snapshot)
 static int32
 getbytealen(bytea *data)
 {
-   Assert(!VARATT_IS_EXTENDED(data));
-   if (VARSIZE(data)  VARHDRSZ)
-   elog(ERROR, invalid VARSIZE(data));
-   return (VARSIZE(data) - VARHDRSZ);
+   Size size;
+
+   size = toast_raw_datum_size(PointerGetDatum(data));
+   if (size  VARHDRSZ)
+   elog(ERROR, invalid size);
+   if (size  VARHDRSZ + LOBLKSIZE)
+   elog(ERROR, too large page);
+   return (size - VARHDRSZ);
 }
 
 
@@ -359,22 +363,12 @@ inv_getsize(LargeObjectDesc *obj_desc)
{
Form_pg_largeobject data;
bytea  *datafield;
-   boolpfreeit;
 
if (HeapTupleHasNulls(tuple))   /* paranoia */
elog(ERROR, null field found in pg_largeobject);
data = (Form_pg_largeobject) GETSTRUCT(tuple);
datafield = (data-data);  /* see note at top of 
file */
-   pfreeit = false;
-   if (VARATT_IS_EXTENDED(datafield))
-   {
-   datafield = (bytea *)
-   heap_tuple_untoast_attr((struct varlena *) 
datafield);
-   pfreeit = true;
-   }
lastbyte = data-pageno * LOBLKSIZE + getbytealen(datafield);
-   if (pfreeit)
-   pfree(datafield);
}
 
systable_endscan_ordered(sd);
@@ -724,8 +718,9 @@ inv_write(LargeObjectDesc *obj_desc, const char *buf, int 
nbytes)
 void
 inv_truncate(LargeObjectDesc *obj_desc, int len)
 {
-   int32   pageno = (int32) (len / LOBLKSIZE);
-   int off;
+   const int32 pageno = (int32) (len / LOBLKSIZE);
+   int32   reqpageno;
+   const int   off = len % LOBLKSIZE;  /* offset in the page */
ScanKeyData skey[2];
SysScanDesc sd;
HeapTuple   oldtuple;
@@ -741,6 +736,7 @@ inv_truncate(LargeObjectDesc *obj_desc, int len)
Datum   values[Natts_pg_largeobject];
boolnulls[Natts_pg_largeobject];
boolreplace[Natts_pg_largeobject];
+   boolprevpagefull;
CatalogIndexState indstate;
 
Assert(PointerIsValid(obj_desc));
@@ -770,10 +766,20 @@ inv_truncate(LargeObjectDesc *obj_desc, int len)

Re: [HACKERS] [NOVICE] systable_getnext_ordered

2011-02-01 Thread YAMAMOTO Takashi
hi,

thanks for taking a look.

 y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes:
 the attached patch is to avoid unnecessary detoast'ing and EOF marker pages
 when possible.  does it make sense?
 
 The blob page size is already chosen not to allow for out-of-line
 storage, not to mention that pg_largeobject doesn't have a TOAST table.
 So I think avoiding detoasting is largely a waste of time.

doesn't detoasting involve decompression?

 I'm
 unexcited about the other consideration too --- it looks to me like it
 just makes truncation slower, more complicated, and hence more
 bug-prone, in return for a possible speedup that probably nobody will
 ever notice.

slower?  it depends, i guess.

my primary motivation of that part of the patch was to save some space for
certain workloads.  (besides that, leaving unnecessary rows isn't neat,
but it might be a matter of taste.)

YAMAMOTO Takashi

 
   regards, tom lane
 
 -- 
 Sent via pgsql-novice mailing list (pgsql-nov...@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-novice

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [NOVICE] systable_getnext_ordered

2011-01-31 Thread Tom Lane
y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes:
 the attached patch is to avoid unnecessary detoast'ing and EOF marker pages
 when possible.  does it make sense?

The blob page size is already chosen not to allow for out-of-line
storage, not to mention that pg_largeobject doesn't have a TOAST table.
So I think avoiding detoasting is largely a waste of time.  I'm
unexcited about the other consideration too --- it looks to me like it
just makes truncation slower, more complicated, and hence more
bug-prone, in return for a possible speedup that probably nobody will
ever notice.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [NOVICE] systable_getnext_ordered

2011-01-26 Thread Tom Lane
I wrote:
 y...@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes:
 after systable_getnext_ordered returned NULL, is it ok to call it again?

 I wouldn't rely on it working.

 i'm wondering because inv_truncate seems to do it and expecting NULL.

 Hmm, that may well be a bug.  Have you tested it?

I looked at this a bit more closely, and basically the answer is that
inv_truncate is accidentally failing to fail.  What will actually happen
if systable_getnext_ordered is called another time, at least when using
a btree index, is that it will start the same requested scan over again,
ie deliver all the tuples it did the first time (which is none, in this
case).  That's an implementation artifact, but since the behavior is
undefined in the first place, it's not wrong.

Now, if inv_truncate's initial call on systable_getnext_ordered returns
NULL (ie, the truncation point is past the current EOF page), it will
fall through to the Write a brand new page code, which will create and
insert a partial page at the truncation point.  It then goes to the
delete-all-remaining-pages loop.  Because that starts a fresh scan with
the very same scan key conditions, you might expect that it would find
and delete the page it just inserted --- causing the apparent EOF of the
blob to be wrong afterwards.  It accidentally fails to do that because
the new tuple postdates the snapshot it's scanning with.  So the loop
terminates having found no matching tuples, and all is well.

So this code is confusing, inefficient (performing a useless search of
the index), only works because of an obscure consideration not explained
in the comments, and sets a bad precedent for people to follow.  I'm
going to go change it to explicitly not do the final loop if the initial
search failed.  It's not a bug, exactly, but it's sure lousy coding.
Thanks for pointing it out.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers