Hi Craig,

Thanks for writing.

If one were to try to increase the limitation of tsvectors (I'm not sure I
need to yet; this thread is mainly to determine that.)
Instead of using a solution involving a "vocabulary" file,
one would probably be better off discarding tsvectors making a vocabulary
table then linking it to documents with a (dict_id, hit_count, word_id)
table
It would be faster, smaller, and more accurate because it would not contain
the now useless position information, while it would contain the otherwise
lost word count information.

I wonder if anyone has any incite on the inner workings of ORACLE/MSSQL/etc
FTS.
Maybe there is a common design pattern I/we can use to handle text files of
non trivial vocabulary that preserves position information.
I'm not sure a (dict_id, position, word_id) table would be queryable in a
useful way
because as far as I can think at the moment there is no good way to compare
the position in different rows with SQL.

I will collect a few more sample files to see how much I relay need this
.... multi-language files are probably the worst offenders.



On Tue, Jun 14, 2011 at 5:12 PM, Craig James <craig_ja...@emolecules.com>wrote:

>  On 6/14/11 1:42 PM, Tim wrote:
>
> So I ran this test:
> unzip -p text.docx word/document.xml | perl -p -e
> 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> ls -hal ./text.*
> #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
> mv /tmp/text.* /var/lib/postgresql/9.0/main/
> cd ~/;psql -d postgres
> #psql (9.0.4)
> CREATE DATABASE test;
> \q
> cd ~/;psql -d test
> CREATE TABLE test(title VARCHAR(256), data OID, words TSVECTOR);
> INSERT INTO test VALUES (  'text.docx',  LO_IMPORT('text.docx'),
> TO_TSVECTOR(pg_read_file('text.txt' ,0, 100000000))  );
>
>  and I got this:
>  #ERROR:  string is too long for tsvector (30990860 bytes, max 1048575
> bytes)
>
>  doing the math
> echo "scale=3;29/(30990860/1048575)"|bc #==0.981
>
> Indicates a (worst case) limit of searching only the first 1MB of a text
> file before you need to start building your own search maybe on top of
> tsvector.
> The year is 2011 I don't think searching a 2MB text file is to much to
> expect.
>
> The novel "Hawaii" at 960 pages is roughly 1MB.  tsvector was intended for
> documents (web pages, news articles, corporate memos, ...), not for books.
> What you're asking for is interesting, but you can't complain that an
> open-source project that was designed for a different purpose doesn't meet
> your needs.
>
> So how am I to use the PGSQL FTS as a "full text search" when the above
> example can only handle a "small or partial text search"?
> If I'm not missing anything maybe the documentation should be adjusted
> accordingly.
>
> Maybe a better question is, "So how am I to use PGSQL FTS as a "massively
> huge text search" when it was designed for nothing bigger than "huge text
> search"?
>
> Any thoughts or alternatives are most welcome.
>
> I'm curious how tsvector could be useful on a 29 MB document.  That's
> roughly one whole encyclopedia set.  A document that size should have a huge
> vocabulary, and tsvector's index would be saturated.
>
> However, if the vocabulary in this 29 MB document isn't that big, then you
> might consider creating a smaller "document."  You could write a Perl script
> that scans the document and creates a dictionary which it writes out as a
> secondary "vocabulary" file that's a list of the unique words in your
> document.  Create an auxillary column in your database to hold this
> vocabulary for each document, and use tsvector to index that.  The perl
> program would be trivial, and tsvector would be happy.
>
> Craig
>
>

Reply via email to