String handling

Marvin Humphrey Thu, 04 Sep 2008 17:10:12 -0700

Greets,

Problem #1: C-style NULL-terminated strings are hateful. They presentinnumerable opportunities for buffer-overrun security holes andsegfaults. You can't know the encoding. You can't know theirallocated capacity.

The solution is to pass around string data using a struct-based stringclass that tightly associates character data with the object's lengthand allocated capacity. This presents some inconveniences in terms ofinitializing and argument passing, but in a library as large as Lucy,it is justified on the security rationale alone.

Problem #2: The various host languages that Lucy will be bound to havedifferent ways of representing string data. With Java, strings useUTF-16; with Perl strings are either UTF-8 or Latin-1 but we can forcethem to be UTF-8; etc.

I believe the solution to this is to create an abstract CharBuf baseclass that represents a Unicode string, plus CharBuf8 and CharBuf16subclasses. As much as possible, we have Lucy internals deal with theopaque string parent class.

The KinoSearch code base currently contains a class named "CharBuf"which is basically CharBuf8. I've been going through and modifyingsections which accessed the char* internal string directly andchanging them to use higher level abstraction; I haven't run into anyshowstoppers yet.


There are 5 main ways that strings will be used in Lucy:

  * Analysis/Tokenizing/Indexing.
  * File path manipulation.
  * Query parsing and construction.
  * Field name specification.
  * Error messages.

Of those list items, the only one that requires intimate interactionwith the low-level encoding of the string for efficiency reasons isanalysis.

The others often involve a lot of interaction with the host. We wantto avoid malloc'ing new temp copies of strings whenever possible inthe binding code; in the KS binding code, I've avoided that byallocating a "ViewCharBuf" on the stack and then copying the Perlscalar's string into it. Here's the routine used to initialize aViewCharBuf:


  static CHY_INLINE lucy_ViewCharBuf
  lucy_VCB_make_str(char *ptr, size_t size)
  {
      lucy_ViewCharBuf retval;
      retval.ref.count = 1;
      retval._     = (lucy_VirtualTable*)&LUCY_VIEWCHARBUF;
      retval.cap   = 0;
      retval.size  = size;
      retval.ptr   = ptr;
      return retval;
  }

It gets used like so:

  lucy_Obj*
  lucy_XSBind_hash_fetch(lucy_Hash *hash, SV *key_sv)
  {
      size_t size;
      char *ptr = SvPVutf8(key_sv, size);
      lucy_ViewCharBuf key = lucy_VCB_make_str(ptr, size);
      return Lucy_Hash_Fetch(hash, (lucy_CharBuf*)&key);
  }

I think this approach should work for other host languages as well.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

String handling

Reply via email to