Re: [ADMIN] tsvector limitations

Tim Tue, 14 Jun 2011 16:05:39 -0700

Hi Kevin,

My test was indeed atypical vocabulary; it was a dictionary file.
I was intentionally trying to hit the limit to find out where it was,
because the documentation did not directly address it.
I am mainly trying to find out if this actually will be a limitation for me.
Thank you for contributing the test data on the PostgreSQL docs
(text=11MB,tsvector=0.356MB) and anonymous technical book (text=0.2MB,
tsvector=0.1MB).
It seems that as long as a document uses a small % of the potential language
vocabulary it would be hard to hit the limit.
On the other hand if someone tries to upload a dictionary or a file using
more than 4% of one vocabulary it would need special handling.
(I guess this would be more likely on larger vocabularies like maybe the
Japanese writing systems)
The 4% number is probably low due to misspellings, slang, etc.


Anyway I now have an approximate answer to the original question of where
the limit is,
and it's probably safe to for my goals to just check and warn if a file
can't be indexed.


On Tue, Jun 14, 2011 at 6:19 PM, Kevin Grittner <kevin.gritt...@wicourts.gov
> wrote:

> Tim <elatl...@gmail.com> wrote:
>
> > So I ran this test:
> > unzip -p text.docx word/document.xml | perl -p -e
> > 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> > ls -hal ./text.*
> > #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> > #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
> > mv /tmp/text.* /var/lib/postgresql/9.0/main/
> > cd ~/;psql -d postgres
> > #psql (9.0.4)
> > CREATE DATABASE test;
> > \q
> > cd ~/;psql -d test
> > CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
> > INSERT INTO test VALUES (  'text.docx',  LO_IMPORT('text.docx'),
> > TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000))  );
> >
> > and I got this:
> >  #ERROR:  string is too long for tsvector (30990860 bytes, max
> > 1048575 bytes)
>
> Your test (whatever data it is that you used) don't seem typical of
> English text.  The entire PostgreSQL documentation in HTML form,
> when all the html files are concatenated is 11424165 bytes (11MB),
> and the tsvector of that is 364410 (356KB).  I don't suppose you
> know of some publicly available file on the web that I could use to
> reproduce your problem?
>
> > The year is 2011 I don't think searching a 2MB text file is to
> > much to expect.
>
> Based on the ratio for the PostgreSQL docs, it seems possible to
> index documents considerably larger than that.  Without the markup
> (as in the case of a PDF), I bet it would take a lot less than what
> I saw for the docs.  A printed or typewritten page usually has about
> 2KB of text per page.  I used pdftotext to get as text the contents
> of a 119 page technical book about database technology, and it came
> to 235KB of text.  I made a tsvector for that, and it was 99KB.  So,
> at *that* rate you'd need about 100 books that size, totaling
> 11,900 pages of text in a document to hit the limit you showed.
> Well, probably more than that, because some of the words might be
> repeated from one book to another.
>
> So, I'm back to wondering what problem you're trying to solve where
> this is actually a limitation for you.
>
> -Kevin
>

Re: [ADMIN] tsvector limitations

Reply via email to