Re: [HACKERS] GIN fast insert

Teodor Sigaev Wed, 11 Feb 2009 07:02:41 -0800

What would be wrong with letting it degrade to lossy?  I suppose the
reason it's trying to avoid that is to avoid having to recheck all the
rows on that page when it comes time to do the index insertion; but
surely having to do that is better than having arbitrary, unpredictable
failure conditions.


No, I don't think that's it.  See here, beginning with "the problem
with lossy tbm has two aspects":

http://archives.postgresql.org/message-id/4974b002.3040...@sigaev.ru

Right. Some comments to that points:

 - amgettuple interface hasn't possibility to work with page-wide result instead
   of exact ItemPointer. amgettuple can not return just a block number as
   amgetbitmap can.

It's not so difficult to teach GIN to return ItemPointer one by one from pendinglist and eliminate this point. GIN will not collect matched ItemPointers intidbitmap and will return them immediately. But:


 - Because of concurrent vacuum process: while we scan pending list, it's
   content could be transferred into regular structure of index and then we will
   find the same tuple twice. Again, amgettuple hasn't protection from that,
   only amgetbitmap has it. So, we need to filter results from regular GIN
   by results from pending list. ANd for filtering we can't use lossy tbm.

Again, we talk about amgettuple interface. We need to filter results fromregular GIN by results from pending list. Now GIN does it by lookup matchedItemPointer in tidbitmap constructed from pending list. We could use orderedarray of ItemPointers instead of tidbitmap, but I don't believe that it willtake less memory. It's impossible to rescan pending list for two reasons: a) tooslow, b) concurrent cleanup process (vacuum or insert now), because they couldmove tuples into regular structure.

Is it acceptable to add option to tidbitmap which will forbid tidbitmap tobecome lossy? That removes disabling index scan in gincostestimate and checkingof non-lossy tidbitmap.

In current version, cleanup of pending list starts much earlier than non-lossylimit is reached in typical use-cases. Insertion process will start cleanup withat least one fired trigger:

 - number of heap tuples in pending list could produce lossy tidbitmap

- total size of pending list is greater than work_mem. This trigger isdeveloped to speedup bulk insertion (which is used in cleanup), because it willcollect whole pending list in memory at once. And this trigger is more strictthan first one because in typical use-case of GIN heap tuple is rather big.


I believe that user could get GIN's error about work_mem only intentionally:
- turn off autovacuum
- set big work_mem
- populate table with GIN index (by needed number of insertion)

- prepare query which will return a lot of results (possibly, with seqscan=offbecause cost of scan of pending list grows fast)

- decrease work_mem for at least ten times
- execute query


--
Teodor Sigaev                                   E-mail: teo...@sigaev.ru
                                                   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN fast insert

Reply via email to