Hi Kevin, My test was indeed atypical vocabulary; it was a dictionary file. I was intentionally trying to hit the limit to find out where it was, because the documentation did not directly address it. I am mainly trying to find out if this actually will be a limitation for me. Thank you for contributing the test data on the PostgreSQL docs (text=11MB,tsvector=0.356MB) and anonymous technical book (text=0.2MB, tsvector=0.1MB). It seems that as long as a document uses a small % of the potential language vocabulary it would be hard to hit the limit. On the other hand if someone tries to upload a dictionary or a file using more than 4% of one vocabulary it would need special handling. (I guess this would be more likely on larger vocabularies like maybe the Japanese writing systems) The 4% number is probably low due to misspellings, slang, etc.
Anyway I now have an approximate answer to the original question of where the limit is, and it's probably safe to for my goals to just check and warn if a file can't be indexed. On Tue, Jun 14, 2011 at 6:19 PM, Kevin Grittner <kevin.gritt...@wicourts.gov > wrote: > Tim <elatl...@gmail.com> wrote: > > > So I ran this test: > > unzip -p text.docx word/document.xml | perl -p -e > > 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt > > ls -hal ./text.* > > #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx > > #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt > > mv /tmp/text.* /var/lib/postgresql/9.0/main/ > > cd ~/;psql -d postgres > > #psql (9.0.4) > > CREATE DATABASE test; > > \q > > cd ~/;psql -d test > > CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR); > > INSERT INTO test VALUES ( 'text.docx', LO_IMPORT('text.docx'), > > TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000)) ); > > > > and I got this: > > #ERROR: string is too long for tsvector (30990860 bytes, max > > 1048575 bytes) > > Your test (whatever data it is that you used) don't seem typical of > English text. The entire PostgreSQL documentation in HTML form, > when all the html files are concatenated is 11424165 bytes (11MB), > and the tsvector of that is 364410 (356KB). I don't suppose you > know of some publicly available file on the web that I could use to > reproduce your problem? > > > The year is 2011 I don't think searching a 2MB text file is to > > much to expect. > > Based on the ratio for the PostgreSQL docs, it seems possible to > index documents considerably larger than that. Without the markup > (as in the case of a PDF), I bet it would take a lot less than what > I saw for the docs. A printed or typewritten page usually has about > 2KB of text per page. I used pdftotext to get as text the contents > of a 119 page technical book about database technology, and it came > to 235KB of text. I made a tsvector for that, and it was 99KB. So, > at *that* rate you'd need about 100 books that size, totaling > 11,900 pages of text in a document to hit the limit you showed. > Well, probably more than that, because some of the words might be > repeated from one book to another. > > So, I'm back to wondering what problem you're trying to solve where > this is actually a limitation for you. > > -Kevin >