On Tue, Sep 10, 2013 at 7:38 PM, Bennie Kloosteman <[email protected]>wrote:

>
>
>> Blowing 40-50%  of heap space on 0x00 just leaves a bitter taste in my
>>> mouth.
>>>
>>
>> The data I've seen says it isn't that much. Do you have any actual data,
>> or is this speculative?
>>
>
> In between   .  You have posted some comments on how much string data is
> in heaps  but this comment is for the apps i have build / used ( mainly web
> / server and some Xaml) .   A lot of apps have 80% of data in strings  (
> you can easily see this in .NET with a heap inspector )  , images are
> normally stored outside the heap ,  how   much numeric data do you have ?
>

At the moment, none. But back when we were designing the HaL architecture
we collected a bunch of data. At that time, only 20% of heap data (overall)
was string data.

The catch is that the workloads have changed a lot since 1990, and I really
don't know where things stand today. I would certainly expect to see a much
higher proportion of the heap devoted to strings in current systems. Hell,
the web didn't exist then. Gore hadn't even been elected yet. :-)


Stepping out of the fray for a moment, it is my impression that there are *very
few* applications that require *random* string indexing. If we're* *mainly
concerned with sequential string indexing, then I think the encoding
matters a lot less. In sequential processing, it comes down to two
questions:


   1. How expensive is "fetch next code point"? Which boils down to: how
   often did the code point require a multibyte encoding. Given the amount of
   western text that appears on asian web pages, it seems to me that the utf-8
   encoding may actually be the right answer.
   2. What's the memory cost we have to pay to reduce the number of
   complicated cases? The answer to that seems straightforward.

I'll go further. My impression is the pattern in most applications is to do
a "string seek" at an initial random position followed by sequential
traversal. In something like a regexp engine you may back-track, but the
backtracked positions are cacheable. IF I'm correct, then my sense is that
an O(log s), where s is the number of "segments" in the string, is
perfectly OK. And when that's the case, I think that a FancyString
implementation consisting of a sequence of segments, where each segment has
homogeneous encoding, works fine. It would certainly be fine for things
like XPath and XQuery, which would seem to be the major customers these
days.

Unfortunately, this is one of those cases where benchmarks matter, and the
benchmarks are driven by applications that don't do international string
processing correctly.


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to