Hi Craig, Thanks for writing.
If one were to try to increase the limitation of tsvectors (I'm not sure I need to yet; this thread is mainly to determine that.) Instead of using a solution involving a "vocabulary" file, one would probably be better off discarding tsvectors making a vocabulary table then linking it to documents with a (dict_id, hit_count, word_id) table It would be faster, smaller, and more accurate because it would not contain the now useless position information, while it would contain the otherwise lost word count information. I wonder if anyone has any incite on the inner workings of ORACLE/MSSQL/etc FTS. Maybe there is a common design pattern I/we can use to handle text files of non trivial vocabulary that preserves position information. I'm not sure a (dict_id, position, word_id) table would be queryable in a useful way because as far as I can think at the moment there is no good way to compare the position in different rows with SQL. I will collect a few more sample files to see how much I relay need this .... multi-language files are probably the worst offenders. On Tue, Jun 14, 2011 at 5:12 PM, Craig James <craig_ja...@emolecules.com>wrote: > On 6/14/11 1:42 PM, Tim wrote: > > So I ran this test: > unzip -p text.docx word/document.xml | perl -p -e > 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt > ls -hal ./text.* > #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx > #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt > mv /tmp/text.* /var/lib/postgresql/9.0/main/ > cd ~/;psql -d postgres > #psql (9.0.4) > CREATE DATABASE test; > \q > cd ~/;psql -d test > CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR); > INSERT INTO test VALUES ( 'text.docx', LO_IMPORT('text.docx'), > TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000)) ); > > and I got this: > #ERROR: string is too long for tsvector (30990860 bytes, max 1048575 > bytes) > > doing the math > echo "scale=3;29/(30990860/1048575)"|bc #==0.981 > > Indicates a (worst case) limit of searching only the first 1MB of a text > file before you need to start building your own search maybe on top of > tsvector. > The year is 2011 I don't think searching a 2MB text file is to much to > expect. > > The novel "Hawaii" at 960 pages is roughly 1MB. tsvector was intended for > documents (web pages, news articles, corporate memos, ...), not for books. > What you're asking for is interesting, but you can't complain that an > open-source project that was designed for a different purpose doesn't meet > your needs. > > So how am I to use the PGSQL FTS as a "full text search" when the above > example can only handle a "small or partial text search"? > If I'm not missing anything maybe the documentation should be adjusted > accordingly. > > Maybe a better question is, "So how am I to use PGSQL FTS as a "massively > huge text search" when it was designed for nothing bigger than "huge text > search"? > > Any thoughts or alternatives are most welcome. > > I'm curious how tsvector could be useful on a 29 MB document. That's > roughly one whole encyclopedia set. A document that size should have a huge > vocabulary, and tsvector's index would be saturated. > > However, if the vocabulary in this 29 MB document isn't that big, then you > might consider creating a smaller "document." You could write a Perl script > that scans the document and creates a dictionary which it writes out as a > secondary "vocabulary" file that's a list of the unique words in your > document. Create an auxillary column in your database to hold this > vocabulary for each document, and use tsvector to index that. The perl > program would be trivial, and tsvector would be happy. > > Craig > >