So I ran this test: unzip -p text.docx word/document.xml | perl -p -e 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt ls -hal ./text.* #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt mv /tmp/text.* /var/lib/postgresql/9.0/main/ cd ~/;psql -d postgres #psql (9.0.4) CREATE DATABASE test; \q cd ~/;psql -d test CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR); INSERT INTO test VALUES ( 'text.docx', LO_IMPORT('text.docx'), TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000)) );
and I got this: #ERROR: string is too long for tsvector (30990860 bytes, max 1048575 bytes) doing the math echo "scale=3;29/(30990860/1048575)"|bc #==0.981 Indicates a (worst case) limit of searching only the first 1MB of a text file before you need to start building your own search maybe on top of tsvector. The year is 2011 I don't think searching a 2MB text file is to much to expect. So how am I to use the PGSQL FTS as a "full text search" when the above example can only handle a "small or partial text search"? If I'm not missing anything maybe the documentation should be adjusted accordingly. Any thoughts or alternatives are most welcome.