On Fri, 03 Dec 2010 22:08:37 -0500, Jerry Quinn <[email protected]> wrote:

I'm actually working in C++ but keeping an eye on things going on
in D-land.  The kind of stuff we do is to normalize text in preparation
for natural language processing.

As a simple example, let's say you want to use a set of regexes to identify patterns in text. You want to return the offsets of each regex that matches. However, before the regexes run, you replace all html tags with a placeholder, so they can easily span
tags without worrying about the contents.

I'm assuming you are not changing the length of the string, or is that not correct?

Before you return the results to the
user, though, you need to translate those offsets back to ones for the original
string.

Hm... I guess you must be changing the lengths if the offsets are different. That seems odd, wouldn't you encounter performance issues when processing large documents?

Everything is unicode of course and we care about processing unicode code points, but want to maintain UTF-8 storage underneath to keep size down.

In reality, we're often doing things like single character normalizations as well as larger spans, but still need to maintain alignment to the original data.

As long as this is reasonable to do, I'm fine. I just wasn't sure from the descriptions I was seeing.

What you will have is access to the underlying char[] array, which should give you full edit access. I just don't want strings to be easily editable since doing so can be very difficult.

Any offsets to dchar code-points in the string will match offsets to char code-units. In effect, you are always indexing by code-unit, even though with the string type you get code-points back.

It should be as simple as accessing a member (like str.data) or casting (i.e. cast(char[])str). I'm unsure yet if it's dangerous enough to require casting.

-Steve

Reply via email to