Re: [HACKERS] WIP: Fast GiST index build

Heikki Linnakangas Thu, 25 Aug 2011 11:53:54 -0700

On 24.08.2011 16:57, Alexander Korotkov wrote:

I've added some testing results to the wiki page:
http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
There are not all the results I planned for the first chunk because it takes
more time than I expect.
Some notes about it.


Now I see two causes which accelerate regular build of GiST indexes:
1) As it was noted before regular index build of pretty ordered dataset is
fast.
2) I found that worse index is faster to build. I mean worse index is index
with higher overlaps. Function gistchoose selects the first index tuple with
zero penalty if any. Thus, with higher overlap in root page only few index
tuples of it will be choosed for insert. And, recursively, only small part
of the tree will be used for actual inserts. And that part of tree can
easier fit to the cache. Thus, high overlaps  makes inserts cheaper as much
as searches expensiver.

As an extreme case, a trivial penalty function that just always returns0 will make index build fast - but the index will be useless for querying.

In the tests on the first version of patch I found index quality of regular
build much better than it of buffering build (without neighborrelocation).
Now it's similar, though it's because index quality of regular index build
become worse. There by in current tests regular index build is faster than
in previous. I see following possible causes of it:
  1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I doubt.
Despite these possible explanation it seems quite strange for me.

That's pretty surprising. Assuming the data is truly random, I wouldn'texpect a big difference in the index quality of one random data set overanother. If the index quality depends so much on, say, the distributionof the few first tuples that are inserted to it, that's a quiteinteresting find on its own, and merits some further research.

In order to compare index build methods on more qualitative indexes, I've
tried to build indexes with my double sorting split method (see:
http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
on uniform dataset search is faster in about 10 times! And, as it was
expected, regular index build becomes much slower. It runs more than 60
hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality
results in all the tests. While it's hard for me to explain that.

Hmm, makes me a bit uneasy that we're testing with a modified pagesplitting algorithm. But if the new algorithm is that good, could youpost that as a separate patch, please?

That said, I don't see any new evidence that the buffering buildalgorithm would be significantly worse. There's the case of ordered datathat we already knew about, and will have to just accept for now.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

Reply via email to