On Wed, Oct 12, 2011 at 05:57:33PM +0200, goran kent wrote: > > It sounds like, without seeing a reproduce-able test case, that Lucy is > > choking appropriately on malformed UTF-8. > > Absolutely. What's interesting is that the same Lucy code does not > choke on the other machines with the older Perl.
Lucy trusts that incoming data it has received from Perl is well-formed. (Technically, it assumes that string data obtained via the XS routine SvPVutf8() is well-formed UTF-8, notwithstanding the difference between Perl's loose internal representation and the Unicode standard for UTF-8.) We could add an index-time validity check, but that would slow down indexing. At search-time, though, Lucy is reading from the file system rather than receiving data from Perl -- and data from the file system cannot be trusted. Therefore, Lucy always performs validity checks when reading what is ostensibly UTF-8 data out of an existing index. I don't know of a mechanism whereby Lucy's behavior would change between different versions of Perl. In any case, having invalid UTF-8 in your Perl scalars is bad news -- it can do things like crash the regex engine. It will also lead to corrupt Lucy indexes that fail the search-time UTF-8 validity check. How are you getting raw data into Perl? > Anyway, I like the idea of rolling my own perl to be absolutely sure > of coherence across my machines. +1 Marvin Humphrey
