Kevin Grittner wrote:
> Tim wrote:
>
<...>
> Your test (whatever data it is that you used) don't seem typical of
> English text. The entire PostgreSQL documentation in HTML form,
> when all the html files are concatenated is 11424165 bytes (11MB),
> and the tsvector of that is 364410 (356KB).
Hi Cliff
Apache SOLR et al especially the regex search abilities look interesting.
They seems to handle files in databases as well as those in filesystems.
It is likely a bit detached, overkill, and heavy for my needs but I'll keep
it in mind if PostgreSQL can't fill them.
On Tue, Jun 14, 2011
Hi Kevin,
My test was indeed atypical vocabulary; it was a dictionary file.
I was intentionally trying to hit the limit to find out where it was,
because the documentation did not directly address it.
I am mainly trying to find out if this actually will be a limitation for me.
Thank you for contri
Hi Craig,
Thanks for writing.
If one were to try to increase the limitation of tsvectors (I'm not sure I
need to yet; this thread is mainly to determine that.)
Instead of using a solution involving a "vocabulary" file,
one would probably be better off discarding tsvectors making a vocabulary
tabl
Tim wrote:
> So I ran this test:
> unzip -p text.docx word/document.xml | perl -p -e
> 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
> ls -hal ./text.*
> #-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
> #-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
On 6/14/11 1:42 PM, Tim wrote:
So I ran this test:
unzip -p text.docx word/document.xml | perl -p -e 's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep
".." > text.txt
ls -hal ./text.*
#-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
#-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./t
So I ran this test:
unzip -p text.docx word/document.xml | perl -p -e
's/<.+?>/\n/g;s/[^a-z0-9\n]/\n/ig;'|grep ".." > text.txt
ls -hal ./text.*
#-rwxrwxrwx 1 postgres postgres 15M 2011-06-14 15:12 ./text.docx
#-rwxrwxrwx 1 postgres postgres 29M 2011-06-14 15:17 ./text.txt
mv /tmp/text.* /var/lib/p
Hi Kevin,
Thanks again for the reply.
I suspect casting and using octet_length() is not accurate.
Using "extract[ed] text" keyword or summaries would indeed be quick but is
not what I'm looking for.
I am inquiring about real-world numbers for full text search of large
documents, I'm not sure what
"Kevin Grittner" wrote:
> You could cast to text and use octet_length().
Or perhaps you're looking for pg_column_size().
http://www.postgresql.org/docs/9.0/interactive/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE
-Kevin
--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To
Tim wrote:
> I would be surprised if there is no general "how big is this
> object" method in PostgreSQL.
You could cast to text and use octet_length().
> If it's "bad design" to store large text documents (pdf,docx,etc)
> as a BLOBs or on a filesystem and make them searchable with
> tsvecto
Hi Kevin,
Thanks for the reply.
I suspect there must have been some testing when the tsvector was created,
and I would be surprised if there is no general "how big is this object"
method in PostgreSQL.
That said perhaps this is the wrong mailing list for this question.
If it's "bad design" to sto
Tim wrote:
> How many bytes of a tsvector would a 32MB ascii english unique
> word list make?
> How many bytes of a tsvector would something like "The Lord of the
> Rings.txt" make?
It would appear that nobody has run into this as a limit, nor done
those specific tests. Storing a series of no
12 matches
Mail list logo