Re: [review] new string type

Steven Schveighoffer Sat, 04 Dec 2010 21:00:34 -0800

On Fri, 03 Dec 2010 22:08:37 -0500, Jerry Quinn <[email protected]>wrote:

I'm actually working in C++ but keeping an eye on things going on
in D-land.  The kind of stuff we do is to normalize text in preparation
for natural language processing.
As a simple example, let's say you want to use a set of regexes toidentify patternsin text. You want to return the offsets of each regex that matches.However, before the regexesrun, you replace all html tags with a placeholder, so they can easilyspan
tags without worrying about the contents.

I'm assuming you are not changing the length of the string, or is that notcorrect?

Before you return the results to the
user, though, you need to translate those offsets back to ones for theoriginal
string.

Hm... I guess you must be changing the lengths if the offsets aredifferent. That seems odd, wouldn't you encounter performance issues whenprocessing large documents?

Everything is unicode of course and we care about processing unicodecode points, but want to maintain UTF-8 storage underneath to keep sizedown.
In reality, we're often doing things like single characternormalizations as well as larger spans, but still need to maintainalignment to the original data.
As long as this is reasonable to do, I'm fine. I just wasn't sure fromthe descriptions I was seeing.

What you will have is access to the underlying char[] array, which shouldgive you full edit access. I just don't want strings to be easilyeditable since doing so can be very difficult.

Any offsets to dchar code-points in the string will match offsets to charcode-units. In effect, you are always indexing by code-unit, even thoughwith the string type you get code-points back.

It should be as simple as accessing a member (like str.data) or casting(i.e. cast(char[])str). I'm unsure yet if it's dangerous enough torequire casting.


-Steve

Re: [review] new string type

Reply via email to