Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?

Andrew Boling Tue, 02 Sep 2014 12:18:30 -0700

>
> I wrote the grapheme break functions.  It didn't occur to me that it would
> be
> useful to return anything, because usually the breakpoints are scanned to
> find good places to break, and usually those are pretty common.
>

It's probably not a common use case (otherwise someone would have said the
same thing about the _wordbreaks series already), but I'll elaborate a
little bit to help demonstrate an applicable scenario.

The strings my functions operate on are arrays in memory with associated
link counts. The original code used random access to perform string
manipulation, but that's not a valid approach when n_bytes != n_codepoints
(non-ASCII). The new approach I'm using is to pre-generate the grapheme
breaks when the string is instantiated (u8_wordbreaks). This way the break
positions are only calculated once across the life of that string. Knowing
the grapheme count is beneficial here as the operation can be immediately
rejected without an additional scan.

If the string is modified, that instantiates a completely new string and
reduces the link count of the string that was operated on by one.
(potentially freeing the old string and its associated grapheme breaks
array)

On Tue, Sep 2, 2014 at 2:09 PM, Ben Pfaff <[email protected]> wrote:

> On Mon, Sep 1, 2014 at 2:24 PM, Andrew Boling <[email protected]> wrote:
> > The _wordbreaks and _grapheme_breaks functions, while useful, currently
> > return void instead of the number of breaks written to the output array.
> Is
> > there a reason why it would be inappropriate to return the number of
> breaks
> > (or number of clusters) in this context? I'm not opposed to scanning the
> > result buffer to determine this information, but the second pass strikes
> me
> > as unnecessary.
>
> I wrote the grapheme break functions.  It didn't occur to me that it would
> be
> useful to return anything, because usually the breakpoints are scanned to
> find good places to break, and usually those are pretty common.
>
> > In my particular case I need to split strings at grapheme boundaries
> based
> > on user supplied integers, and it would make sense to skip the operation
> > entirely if (n >= array_units || n >= grapheme_clusters).
>
> I guess that if this is a common need (I do not really understand your
> application) then returning the number of breaks would make sense.
>

Re: [bug-libunistring] _wordbreaks/_grapheme_breaks and break count?

Reply via email to