Now, I'm seeing this error against latest svn trunk:

Invalid UTF-8 sequence in
'/opt/pij/search/sources.index.ks/seg_1/lextemp-12464-to-1353267' at
byte 12466, kino_TextTermStepper_read_delta at
../core/KinoSearch/FieldType/TextType.c line 145

The frustrating thing is that I just spent 2 weeks making sure my files
are all valid UTF-8 (same old story -- legacy db with mix of latin1,
cp1252, and UTF-8, sometimes all in the same string!), and they all pass
my Search::Tools::UTF8 checks.

What's odd is that the 'Invalid UTF-8 sequence' error is thrown during
commit() rather than when I add_doc(), which makes me think that perhaps
this isn't necessarily an encoding problem with my docs. I see that all
text strings are forced to UTF-8 in add_doc() via invert_doc() and the
SvPVutf8 call, so presumably they should all be UTF-8 by the time they
reach the commit()?

-- 
Peter Karman  .  http://peknet.com/  .  [email protected]

Reply via email to