Greets,

Problem #1: C-style NULL-terminated strings are hateful. They present innumerable opportunities for buffer-overrun security holes and segfaults. You can't know the encoding. You can't know their allocated capacity.

The solution is to pass around string data using a struct-based string class that tightly associates character data with the object's length and allocated capacity. This presents some inconveniences in terms of initializing and argument passing, but in a library as large as Lucy, it is justified on the security rationale alone.

Problem #2: The various host languages that Lucy will be bound to have different ways of representing string data. With Java, strings use UTF-16; with Perl strings are either UTF-8 or Latin-1 but we can force them to be UTF-8; etc.

I believe the solution to this is to create an abstract CharBuf base class that represents a Unicode string, plus CharBuf8 and CharBuf16 subclasses. As much as possible, we have Lucy internals deal with the opaque string parent class.

The KinoSearch code base currently contains a class named "CharBuf" which is basically CharBuf8. I've been going through and modifying sections which accessed the char* internal string directly and changing them to use higher level abstraction; I haven't run into any showstoppers yet.

There are 5 main ways that strings will be used in Lucy:

  * Analysis/Tokenizing/Indexing.
  * File path manipulation.
  * Query parsing and construction.
  * Field name specification.
  * Error messages.

Of those list items, the only one that requires intimate interaction with the low-level encoding of the string for efficiency reasons is analysis.

The others often involve a lot of interaction with the host. We want to avoid malloc'ing new temp copies of strings whenever possible in the binding code; in the KS binding code, I've avoided that by allocating a "ViewCharBuf" on the stack and then copying the Perl scalar's string into it. Here's the routine used to initialize a ViewCharBuf:

  static CHY_INLINE lucy_ViewCharBuf
  lucy_VCB_make_str(char *ptr, size_t size)
  {
      lucy_ViewCharBuf retval;
      retval.ref.count = 1;
      retval._     = (lucy_VirtualTable*)&LUCY_VIEWCHARBUF;
      retval.cap   = 0;
      retval.size  = size;
      retval.ptr   = ptr;
      return retval;
  }

It gets used like so:

  lucy_Obj*
  lucy_XSBind_hash_fetch(lucy_Hash *hash, SV *key_sv)
  {
      size_t size;
      char *ptr = SvPVutf8(key_sv, size);
      lucy_ViewCharBuf key = lucy_VCB_make_str(ptr, size);
      return Lucy_Hash_Fetch(hash, (lucy_CharBuf*)&key);
  }

I think this approach should work for other host languages as well.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Reply via email to