On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas <hlinnakan...@vmware.com > wrote:
> On 01/21/2014 11:35 AM, Heikki Linnakangas wrote: > >> On 01/21/2014 04:02 AM, Tomas Vondra wrote: >> >>> On 20.1.2014 19:30, Heikki Linnakangas wrote: >>> >>>> >>>> Attached is a yet another version, with more bugs fixed and more >>>> comments added and updated. I would appreciate some heavy-testing of >>>> this patch now. If you could re-run the tests you've been using, >>>> that could be great. I've tested the WAL replay by replicating GIN >>>> operations over streaming replication. That doesn't guarantee it's >>>> correct, but it's a good smoke test. >>>> >>> >>> I gave it a try - the OOM error seems to be gone, but now get this >>> >>> PANIC: cannot insert duplicate items to GIN index page >>> >>> This only happens when building the index incrementally (i.e. using a >>> sequence of INSERT statements into a table with GIN index). When I >>> create a new index on a table (already containing the same dataset) it >>> works just fine. >>> >>> Also, I tried to reproduce the issue by running a simple plpgsql loop >>> (instead of a complex python script): >>> >>> DO LANGUAGE plpgsql $$ >>> DECLARE >>> r tsvector; >>> BEGIN >>> FOR r IN SELECT body_tsvector FROM data_table LOOP >>> INSERT INTO idx_table (body_tsvector) VALUES (r); >>> END LOOP; >>> END$$; >>> >>> where data_table is the table with imported data (the same data I >>> mentioned in the post about OOM errors), and index_table is an empty >>> table with a GIN index. And indeed it fails, but only if I run the block >>> in multiple sessions in parallel. >>> >> >> Oh, I see what's going on. I had assumed that there cannot be duplicate >> insertions into the posting tree, but that's dead wrong. The fast >> insertion mechanism depends on a duplicate insertion to do nothing. >> > > Ok, this turned out to be a much bigger change than I thought... > > In principle, it's not difficult to eliminate duplicates. However, to > detect a duplicate, you have to check if the item is present in the > uncompressed items array, or in the compressed lists. That requires > decoding the segment where it should be. > > But if we decode the segment, what's the purpose of the uncompressed items > array? Its purpose was to speed up insertions, by buffering them so that we > don't need to decode and re-encode the page/segment on every inserted item. > But if we're paying the price of decoding it anyway, we might as well > include the new item and re-encode the segment. The uncompressed area saves > some effort in WAL-logging, as the record of inserting an entry into the > uncompressed area is much smaller than that of re-encoding part of the > page, but if that really is a concern, we could track more carefully which > parts of the page is modified, and only WAL-log the required parts. And > hopefully, the fast-update lists buffer inserts so that you insert many > items at a time to the posting tree, and the price of re-encoding is only > paid once. > > So, now that I think about this once more, I don't think the uncompressed > area in every leaf page is a good idea. > I didn't get why insertion of duplicate TIDs to uncompressed area and eliminate them of re-encoding? It would be somewhat more work during updates, more work during scan, more WAL records. But is it really significant for real-life workloads? I refactored the way the recompression routine works again. It is now more > clearly a multi-step process. First, the existing page is "disassembled" > into an in-memory struct, which is a linked list of all the segments. > In-memory each segment can be represented as an array of item pointers, or > in the compressed format. In the next phase, all the new items are added to > the segments where they belong. This naturally can lead to overly large > segments; in the third phase, the items are redistributed among the > segments, and compressed back to the compressed format. Finally, the > finished segments are written back to the page, or pages if it had to be > split. > > The same subroutines to disassemble and recompress are also used in vacuum. > > Attached is what I've got now. This is again quite heavily-changed from > the previous version - sorry for repeatedly rewriting this. I'll continue > testing and re-reviewing this tomorrow. Let's clarify where we are. We had quite debugged and tested version with hard-wired varbyte encoding. Now we're reimplementing and debugging segmented version of varbyte encoding in a hope that one day we can easily replace it with something better that meets our restrictions (but we didn't find it already). Is it right? ------ With best regards, Alexander Korotkov.