Re: [ADMIN] tsvector limitations

2011-06-14 Thread Greg Williamson
Kevin Grittner wrote: > Tim wrote: > <...> > Your test (whatever data it is that you used) don't seem typical of > English text. The entire PostgreSQL documentation in HTML form, > when all the html files are concatenated is 11424165 bytes (11MB), > and the tsvector of that is 364410 (356KB).

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Tim
Hi Cliff Apache SOLR et al especially the regex search abilities look interesting. They seems to handle files in databases as well as those in filesystems. It is likely a bit detached, overkill, and heavy for my needs but I'll keep it in mind if PostgreSQL can't fill them. On Tue, Jun 14, 2011

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Tim
Hi Kevin, My test was indeed atypical vocabulary; it was a dictionary file. I was intentionally trying to hit the limit to find out where it was, because the documentation did not directly address it. I am mainly trying to find out if this actually will be a limitation for me. Thank you for contri

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Tim
Hi Craig, Thanks for writing. If one were to try to increase the limitation of tsvectors (I'm not sure I need to yet; this thread is mainly to determine that.) Instead of using a solution involving a "vocabulary" file, one would probably be better off discarding tsvectors making a vocabulary tabl

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Kevin Grittner
Tim wrote: > So I ran this test: > unzip -p text.docx word/document.xml | perl -p -e > 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt > ls -hal ./text.* > #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx > #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Craig James
On 6/14/11 1:42 PM, Tim wrote: So I ran this test: unzip -p text.docx word/document.xml | perl -p -e 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt ls -hal ./text.* #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./t

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Tim
So I ran this test: unzip -p text.docx word/document.xml | perl -p -e 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt ls -hal ./text.* #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt mv /tmp/text.* /var/lib/p

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Tim
Hi Kevin, Thanks again for the reply. I suspect casting and using octet_length() is not accurate. Using "extract[ed] text" keyword or summaries would indeed be quick but is not what I'm looking for. I am inquiring about real-world numbers for full text search of large documents, I'm not sure what

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Kevin Grittner
"Kevin Grittner" wrote: > You could cast to text and use octet_length(). Or perhaps you're looking for pg_column_size(). http://www.postgresql.org/docs/9.0/interactive/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE -Kevin -- Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) To

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Kevin Grittner
Tim wrote: > I would be surprised if there is no general "how big is this > object" method in PostgreSQL. You could cast to text and use octet_length(). > If it's "bad design" to store large text documents (pdf,docx,etc) > as a BLOBs or on a filesystem and make them searchable with > tsvecto

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Tim
Hi Kevin, Thanks for the reply. I suspect there must have been some testing when the tsvector was created, and I would be surprised if there is no general "how big is this object" method in PostgreSQL. That said perhaps this is the wrong mailing list for this question. If it's "bad design" to sto

Re: [ADMIN] tsvector limitations

2011-06-14 Thread Kevin Grittner
Tim wrote: > How many bytes of a tsvector would a 32MB ascii english unique > word list make? > How many bytes of a tsvector would something like "The Lord of the > Rings.txt" make? It would appear that nobody has run into this as a limit, nor done those specific tests. Storing a series of no