Greets,
Problem #1: C-style NULL-terminated strings are hateful. They present
innumerable opportunities for buffer-overrun security holes and
segfaults. You can't know the encoding. You can't know their
allocated capacity.
The solution is to pass around string data using a struct-based string
class that tightly associates character data with the object's length
and allocated capacity. This presents some inconveniences in terms of
initializing and argument passing, but in a library as large as Lucy,
it is justified on the security rationale alone.
Problem #2: The various host languages that Lucy will be bound to have
different ways of representing string data. With Java, strings use
UTF-16; with Perl strings are either UTF-8 or Latin-1 but we can force
them to be UTF-8; etc.
I believe the solution to this is to create an abstract CharBuf base
class that represents a Unicode string, plus CharBuf8 and CharBuf16
subclasses. As much as possible, we have Lucy internals deal with the
opaque string parent class.
The KinoSearch code base currently contains a class named "CharBuf"
which is basically CharBuf8. I've been going through and modifying
sections which accessed the char* internal string directly and
changing them to use higher level abstraction; I haven't run into any
showstoppers yet.
There are 5 main ways that strings will be used in Lucy:
* Analysis/Tokenizing/Indexing.
* File path manipulation.
* Query parsing and construction.
* Field name specification.
* Error messages.
Of those list items, the only one that requires intimate interaction
with the low-level encoding of the string for efficiency reasons is
analysis.
The others often involve a lot of interaction with the host. We want
to avoid malloc'ing new temp copies of strings whenever possible in
the binding code; in the KS binding code, I've avoided that by
allocating a "ViewCharBuf" on the stack and then copying the Perl
scalar's string into it. Here's the routine used to initialize a
ViewCharBuf:
static CHY_INLINE lucy_ViewCharBuf
lucy_VCB_make_str(char *ptr, size_t size)
{
lucy_ViewCharBuf retval;
retval.ref.count = 1;
retval._ = (lucy_VirtualTable*)&LUCY_VIEWCHARBUF;
retval.cap = 0;
retval.size = size;
retval.ptr = ptr;
return retval;
}
It gets used like so:
lucy_Obj*
lucy_XSBind_hash_fetch(lucy_Hash *hash, SV *key_sv)
{
size_t size;
char *ptr = SvPVutf8(key_sv, size);
lucy_ViewCharBuf key = lucy_VCB_make_str(ptr, size);
return Lucy_Hash_Fetch(hash, (lucy_CharBuf*)&key);
}
I think this approach should work for other host languages as well.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/