Hi,

I've spent a few hours stress-testing this a bit - loading a mail
archive with ~1M of messages (with headers stored in a jsonb column) and
then doing queries on that. Good news - no crashes or any such issues so
far. The queries that I ran manually seem to return sane results.

The only problem I ran into is with limited index row size with GIN
indexes. I understand it's not a bug, but I admit I haven't realized I
might run into it in this case ...

The data I used for testing is just a bunch of e-mail messages, with
headers stored as jsonb, so each row has something like this in
"headers" column:

{
 "from" : "John Doe <j...@example.com>",
 "to" : ["Jane Doe <j...@example.com>", "Jack Doe <j...@example.com>"],
 "cc" : ...,
 "bcc" : ...,
 ... various other headers ...
}

The snag is that some of the header values may be very long, exceeding
the limit of 1352 bytes and causing errors like this:

  ERROR:  index row size 1416 exceeds maximum 1352 for index "gin_idx"

A good example of such header is "dkim-signature" which basically
contains the whole message digitally signed with DKIM. The signature
tends to be long and non-compressible, thanks to the signature.

I'm wondering what's the best way around this, because I suspect many
new users (especially those attracted by jsonb and GIN improvements)
will run into this. Maybe not immediately, but eventully they'll try to
insert a jsonb with long value, and it will fail ...

With btree indexes on text I would probably create an index on
substr(column,0,1000) or something like that, but doing that with JSON
seems a bit strange.

I assume we need to store the actual values in the GIN index (so a hash
is not sufficient), right?

GIST indexes work, but with that I have to give up the significant
performance gains that we got thanks to Alexander's GIN patches.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to