On Fri, 03 Dec 2010 22:08:37 -0500, Jerry Quinn <[email protected]>
wrote:
I'm actually working in C++ but keeping an eye on things going on
in D-land. The kind of stuff we do is to normalize text in preparation
for natural language processing.
As a simple example, let's say you want to use a set of regexes to
identify patterns
in text. You want to return the offsets of each regex that matches.
However, before the regexes
run, you replace all html tags with a placeholder, so they can easily
span
tags without worrying about the contents.
I'm assuming you are not changing the length of the string, or is that not
correct?
Before you return the results to the
user, though, you need to translate those offsets back to ones for the
original
string.
Hm... I guess you must be changing the lengths if the offsets are
different. That seems odd, wouldn't you encounter performance issues when
processing large documents?
Everything is unicode of course and we care about processing unicode
code points, but want to maintain UTF-8 storage underneath to keep size
down.
In reality, we're often doing things like single character
normalizations as well as larger spans, but still need to maintain
alignment to the original data.
As long as this is reasonable to do, I'm fine. I just wasn't sure from
the descriptions I was seeing.
What you will have is access to the underlying char[] array, which should
give you full edit access. I just don't want strings to be easily
editable since doing so can be very difficult.
Any offsets to dchar code-points in the string will match offsets to char
code-units. In effect, you are always indexing by code-unit, even though
with the string type you get code-points back.
It should be as simple as accessing a member (like str.data) or casting
(i.e. cast(char[])str). I'm unsure yet if it's dangerous enough to
require casting.
-Steve