Hi,
This is probably not a Lucy issue, but something I first noticed while
using Lucy on machines with different Perl versions (using CentOS 5.x
and CentOS 6).
On the machines with Perl 5.8.8 the indexer works as expected - ie, I
have no idea what it's doing when encountering UTF-8 text (which is
fine in my case since we don't really have to deal with UTF-8).
However, on machines where Perl 5.10.1 is installed (CentOS 6),
indexing fails when bad UTF-8 (in this case some nice Japanese fair)
is encountered:
...Malformed UTF-8 character... these are ignored OK.
but then:
...Invalid UTF-8, aborting:
lucy_ViewCB_assign_str at
.../projects/lucy/perl/../core/Lucy/Object/CharBuf.c line 848
at /usr/local/.../myscript line 2201
eval {...} called at ...
followed by
...Expected doc id 4 but got 5
lucy_DocWriter_add_inverted_doc at
.../projects/lucy/perl/../core/Lucy/Index/DocWriter.c line 97
...
and it never recovers.
Any ideas what I should be looking for? Ideally, it would be great if
I could get perl 5.10 to behave like 5.8. I'm tempted to just strip
out invalid crap with "iconv -c --from UTF-8 --to UTF-8", unless I can
find a nice non-regex (for performance) cpan module to either strip
out bad utf8 or to filter out all utf8 unconditionally.
sigh