Now, I'm seeing this error against latest svn trunk: Invalid UTF-8 sequence in '/opt/pij/search/sources.index.ks/seg_1/lextemp-12464-to-1353267' at byte 12466, kino_TextTermStepper_read_delta at ../core/KinoSearch/FieldType/TextType.c line 145
The frustrating thing is that I just spent 2 weeks making sure my files are all valid UTF-8 (same old story -- legacy db with mix of latin1, cp1252, and UTF-8, sometimes all in the same string!), and they all pass my Search::Tools::UTF8 checks. What's odd is that the 'Invalid UTF-8 sequence' error is thrown during commit() rather than when I add_doc(), which makes me think that perhaps this isn't necessarily an encoding problem with my docs. I see that all text strings are forced to UTF-8 in add_doc() via invert_doc() and the SvPVutf8 call, so presumably they should all be UTF-8 by the time they reach the commit()? -- Peter Karman . http://peknet.com/ . [email protected]
