Re: [r6rs-discuss] Strings

Thomas Lord Sun, 25 Mar 2007 14:21:36 -0800

Per Bothner wrote:

Thomas Lord wrote:

Fast pattern matching.  Fast search of texts displaying
various kinds of sorting.  Fast fingerprinting.  Quite
a few more, I'm sure, even though those already broad
categories.


True, though those are pretty esoteric, written as
specialized low-level libraries.

More importantly note that in most cases you don't need to
move by N "characters" - you need to move by N *code
units*, which would be an O(1) operation.  For fast
string matching/searching, as long as the subject
string and the pattern use the same encoding, you can
treat multi-code-unit characters equivalent to a sequence
of code units: Searching a sequence of characters is
equivalent to searching for a sequence of code units.


I'm not sure what you are trying to argue or if you just
mean to pick my brain about implementing strings, buffers,
markers, properties, overlays etc.

It is true that if a particular UTF-* internal representation
is presumed then one way to get good performance is to
translate algorithms over code points into algorithms over
encoding units.  So?

With that set of assumptions (high locality of mutations and
a 25% waste of space) you've leant support to your formal
proposal for STRING-APPEND! but not any for markers.
On the contrary, you should be happy with a STRING-REPLACE!
that replaces an arbitrary substring, not necessarily preserving
length.   Your implementation is free to use gap buffers
internally and, if your use assumptions hold for a given
application, it will perform swell -- no markers needed.


The problem is with the API for specifying the substring:
Assume the buffer contains UTF-8 or UTF-16 code units.
If the position is specified as a character (scalar value)
index then we need to scan (or use some kind of index) to
map that index into a buffer offset.


In the assumptions you asked me to consider, performance
only matters for appending to a string and we can assume
that in most cases space has already been reserved for this.
Now you are revoking the assumptions you asked me to
reason under.     We are going in circles.


-t

If the position is
specified as code unit offset, or a unspecified "magic
cookie" then we can index directly into the buffer.

An advantage of using opaque "marker" objects rather than
indexes is that it's a way to leave the representation
up to the implementation, so it can use whatever is most
efficient.







_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Strings

Reply via email to