Re: [rt-users] RT 4.0.2 postgresql fulltext - error doing initial indexing

2011-09-19 Thread fab junkmail
Hi Alex,

Thank you very much for your responses. Notes inline.

On Tue, Sep 20, 2011 at 4:37 AM, Alex Vandiver  wrote:
> On Mon, 2011-09-19 at 13:24 +1000, fab junkmail wrote:
>> 2011-09-19 02:08:28 UTC ERROR:  string is too long for tsvector
>> (3831236 bytes, max 1048575 bytes)
>> 2011-09-19 02:08:28 UTC STATEMENT:  UPDATE Attachments SET
>> ContentIndex = to_tsvector($1) WHERE id = $2
>>
>>
>> I think it is getting to a ticket that has too many unique words so it
>> can't index it and it critically fails and stops indexing any further.
>
> You are correct that this is because the content of one of the
> attachments contains too many unique words (after removing stopwords and
> doing stemming).  This is symptomatic of a pathological case -- for
> example, the entirety of "A Tale of Two Cities" (775K) creates a 121K
> tsvector and the entire corpus of the King James Bible (4.3M) creates a
> 160K tsvector.  In contrast, the contents of my /usr/share/dict/words
> (916K) produces a 524K tsvector, because there is so little word
> repetition.
>
> Knowing what text/plain or text/html corpus you have in your database
> which is blowing so significantly past this limit (generating a 3.8M
> tsvector is impressive) would be interesting.  I suspect the data in
> question is not actually textual data.  If you re-run
> rt-fulltext-indexer with --debug, the last attachment number it prints
> will tell you which attachment if the problematic one.

I used the --debug option and was able to find the ticket that caused
the problem. The ticket has a 12MB, 125000 line .txt attachment that
is a file system file listing with full paths. Eg:

...
.//Accounts/Backup/New Folder
.//Accounts/Backup/PRM0124.ZIP
.//Accounts/Backup/PRM0206.ZIP
.//Accounts/Backup/PRM0207.ZIP
.//Accounts/Backup/PRM0214.1.ZIP
...

So in many cases it would have counted the whole line as a unique
word. That explains how the tsvector for this attachment got to over
3MB.

>
>> I would appreciate some advice on how I can proceed with getting the
>> rest of my data indexed. I think any of the following would be
>> suitable but I don't know how to implement them (I am not a coder or a
>> dba) and could use some help. Options:
>>
>> - modify the rt-fulltext-indexer script to truncate strings that are
>> "too long for tsvector". or
>
> As pointed out above, long corpuses can generate perfectly reasonably
> sized tsvectors.  Truncating your input strings before indexing will
> yield false negatives in perfectly reasonable text; as such, the change
> from the wiki will not be taken into core.
>
>> - modify the rt-fulltext-indexer script to skip tickets that have that
>> issue and continue indexing other tickets. or
>
> rt-fulltext-indexer currently iterates every attachment content and
> updates the tsvector one at a time; as such, modifying it to trap the
> update with an eval {} block and continue for particular error cases
> should be completely feasible.
>
>> - find out which ticket is causing the problem (hopefully only one)
>> and maybe I can delete it before running the rt-fulltext-indexer
>> script. or
>
> As I noted above, I suspect the row in question is not actually textual
> data, despite being marked text/plain.  As I noted above, running with
> --debug may shed some light on the contents which are at issue.
>  - Alex
>
>

I have deleted (using shredder) the problem ticket and I am re-running
"/opt/rt4/sbin/rt-fulltext-indexer-mod --all". It has been running for
a few minutes now so looks like it has got past the problem and is
working. If it has another problem I now know how to work around it.

Thanks very much for your help Alex.

Regards,
Anthony

RT Training Sessions (http://bestpractical.com/services/training.html)
*  Chicago, IL, USA  September 26 & 27, 2011
*  San Francisco, CA, USA  October 18 & 19, 2011
*  Washington DC, USA  October 31 & November 1, 2011
*  Melbourne VIC, Australia  November 28 & 29, 2011
*  Barcelona, Spain  November 28 & 29, 2011


Re: [rt-users] RT 4.0.2 postgresql fulltext - error doing initial indexing

2011-09-19 Thread Alex Vandiver
On Mon, 2011-09-19 at 13:24 +1000, fab junkmail wrote:
> 2011-09-19 02:08:28 UTC ERROR:  string is too long for tsvector
> (3831236 bytes, max 1048575 bytes)
> 2011-09-19 02:08:28 UTC STATEMENT:  UPDATE Attachments SET
> ContentIndex = to_tsvector($1) WHERE id = $2
> 
> 
> I think it is getting to a ticket that has too many unique words so it
> can't index it and it critically fails and stops indexing any further.

You are correct that this is because the content of one of the
attachments contains too many unique words (after removing stopwords and
doing stemming).  This is symptomatic of a pathological case -- for
example, the entirety of "A Tale of Two Cities" (775K) creates a 121K
tsvector and the entire corpus of the King James Bible (4.3M) creates a
160K tsvector.  In contrast, the contents of my /usr/share/dict/words
(916K) produces a 524K tsvector, because there is so little word
repetition.

Knowing what text/plain or text/html corpus you have in your database
which is blowing so significantly past this limit (generating a 3.8M
tsvector is impressive) would be interesting.  I suspect the data in
question is not actually textual data.  If you re-run
rt-fulltext-indexer with --debug, the last attachment number it prints
will tell you which attachment if the problematic one.

> I would appreciate some advice on how I can proceed with getting the
> rest of my data indexed. I think any of the following would be
> suitable but I don't know how to implement them (I am not a coder or a
> dba) and could use some help. Options:
> 
> - modify the rt-fulltext-indexer script to truncate strings that are
> "too long for tsvector". or

As pointed out above, long corpuses can generate perfectly reasonably
sized tsvectors.  Truncating your input strings before indexing will
yield false negatives in perfectly reasonable text; as such, the change
from the wiki will not be taken into core.

> - modify the rt-fulltext-indexer script to skip tickets that have that
> issue and continue indexing other tickets. or

rt-fulltext-indexer currently iterates every attachment content and
updates the tsvector one at a time; as such, modifying it to trap the
update with an eval {} block and continue for particular error cases
should be completely feasible.

> - find out which ticket is causing the problem (hopefully only one)
> and maybe I can delete it before running the rt-fulltext-indexer
> script. or

As I noted above, I suspect the row in question is not actually textual
data, despite being marked text/plain.  As I noted above, running with
--debug may shed some light on the contents which are at issue.
 - Alex


RT Training Sessions (http://bestpractical.com/services/training.html)
*  Chicago, IL, USA  September 26 & 27, 2011
*  San Francisco, CA, USA  October 18 & 19, 2011
*  Washington DC, USA  October 31 & November 1, 2011
*  Melbourne VIC, Australia  November 28 & 29, 2011
*  Barcelona, Spain  November 28 & 29, 2011


[rt-users] RT 4.0.2 postgresql fulltext - error doing initial indexing

2011-09-18 Thread fab junkmail
Hi,

I am testing upgrading from RT 3.6.7 on mysql to RT 4.0.2 on
postgresql 8.4.8 (debian) and enabling full text search. The database
has more that 3 tickets.

I have been able to convert the database to postgresql and update RT
to 4.0.2. I have done the set up for the full text search using
defaults except for selecting GIN index instead of the default.

When running "/opt/rt4/sbin/rt-fulltext-indexer --all" the indexer
runs for hours with the occasional minor warnings as follows:

[Thu Sep 15 01:56:27 2011] [warning]: NOTICE:  word is too long to be indexed
DETAIL:  Words longer than 2047 characters are ignored.
(/opt/rt4/sbin/rt-fulltext-indexer:370)

I think that is ok.

However the indexer eventually fails/stops with this error:

[Mon Sep 19 02:08:28 2011] [warning]: DBD::Pg::db do failed: ERROR:
string is too long for tsvector (3831236 bytes, max 1048575 bytes) at
/opt/rt4/sbin/rt-fulltext-indexer line 370.
(/opt/rt4/sbin/rt-fulltext-indexer:370)
[Mon Sep 19 02:08:28 2011] [crit]: DBD::Pg::db do failed: ERROR:
string is too long for tsvector (3831236 bytes, max 1048575 bytes) at
/opt/rt4/sbin/rt-fulltext-indexer line 370.
(/opt/rt4/sbin/../lib/RT.pm:340)
DBD::Pg::db do failed: ERROR:  string is too long for tsvector
(3831236 bytes, max 1048575 bytes) at
/opt/rt4/sbin/rt-fulltext-indexer line 370.

related postgresql log:

2011-09-19 02:08:28 UTC ERROR:  string is too long for tsvector
(3831236 bytes, max 1048575 bytes)
2011-09-19 02:08:28 UTC STATEMENT:  UPDATE Attachments SET
ContentIndex = to_tsvector($1) WHERE id = $2


I think it is getting to a ticket that has too many unique words so it
can't index it and it critically fails and stops indexing any further.

If I run "/opt/rt4/sbin/rt-fulltext-indexer --all" again, it stops
with that error again quite quickly (after about a minute).

I tested doing some content searches for a particular word (using full
text search) and not all tickets that contain that word are found.
With full text turned off (content search still enabled) all tickets
with that word are found.

So I think the indexer has not been able to complete indexing all
tickets as it is getting a problem with one ticket that halts the
indexing.

I notice on an older wiki page for PostgreSQLFullText for before RT
had fulltext indexing built in
http://requesttracker.wikia.com/wiki/PostgreSQLFullText they mention
limiting the size of tsvector that will attempt to index: "So I am
adding the tsvectors only to those entries with a size < 500KB"

I think maybe this has not been implemented for rt 4.0.2?

I would appreciate some advice on how I can proceed with getting the
rest of my data indexed. I think any of the following would be
suitable but I don't know how to implement them (I am not a coder or a
dba) and could use some help. Options:

- modify the rt-fulltext-indexer script to truncate strings that are
"too long for tsvector". or
- modify the rt-fulltext-indexer script to skip tickets that have that
issue and continue indexing other tickets. or
- find out which ticket is causing the problem (hopefully only one)
and maybe I can delete it before running the rt-fulltext-indexer
script. or
- maybe someone can suggest a better alternative

Please help.

Regards,
Anthony

RT Training Sessions (http://bestpractical.com/services/training.html)
*  Chicago, IL, USA  September 26 & 27, 2011
*  San Francisco, CA, USA  October 18 & 19, 2011
*  Washington DC, USA  October 31 & November 1, 2011
*  Melbourne VIC, Australia  November 28 & 29, 2011
*  Barcelona, Spain  November 28 & 29, 2011