On 15 June 2016 at 15:03, k...@rice.edu <k...@rice.edu> wrote: > On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: > > Hi, > > > > I have an application which stores a large amounts of hex-encoded hash > > strings (nearly 100 GB of them), which means: > > > > - The number of distinct characters (alphabet) is limited to 16 > > - Each string is of the same length, 64 characters > > - The strings are essentially random > > > > Creating a B-Tree index on this results in the index size being larger > than > > the table itself, and there are disk space constraints. > > > > I've found the SP-GIST radix tree index, and thought it could be a good > > match for the data because of the above constraints. An attempt to create > > it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes > more > > than 12 hours (while a similar B-tree index takes a few hours at most), > so > > I've interrupted it because "it probably is not going to finish in a > > reasonable time". Some slides I found on the spgist index allude that > both > > build time and size are not really suitable for this purpose. > > > > My question is: what would be the most size-efficient index for this > > situation? > > Hi Ivan, > > If the strings are really random, then maybe a function index on the first > 4, 8, or 16 characters could be used to narrow the search space and not > need > to index all 64. If they are not "good" random numbers, you could use a > hash > index on the strings. It will be much smaller since it currently uses a > 32-bit > hash. It has a number of caveats and is not currently crash-safe, but it > seems > like it might work in your environment. You can also use a functional > index on > a hash-function applied to your values with a btree to give you crash > safety. > > Hi,
I figured the hash index might be helpful and I've tried it in the meantime: on one of the smaller tables (which is 51 GB in size), a btree index is 32 GB, while the hash index is 22 GB (so btree is around 45% larger). I don't suppose there's an effort in progress to make hash indexes use WAL? :D