On Tue, Sep 10, 2013 at 7:38 PM, Bennie Kloosteman <[email protected]>wrote:
> > >> Blowing 40-50% of heap space on 0x00 just leaves a bitter taste in my >>> mouth. >>> >> >> The data I've seen says it isn't that much. Do you have any actual data, >> or is this speculative? >> > > In between . You have posted some comments on how much string data is > in heaps but this comment is for the apps i have build / used ( mainly web > / server and some Xaml) . A lot of apps have 80% of data in strings ( > you can easily see this in .NET with a heap inspector ) , images are > normally stored outside the heap , how much numeric data do you have ? > At the moment, none. But back when we were designing the HaL architecture we collected a bunch of data. At that time, only 20% of heap data (overall) was string data. The catch is that the workloads have changed a lot since 1990, and I really don't know where things stand today. I would certainly expect to see a much higher proportion of the heap devoted to strings in current systems. Hell, the web didn't exist then. Gore hadn't even been elected yet. :-) Stepping out of the fray for a moment, it is my impression that there are *very few* applications that require *random* string indexing. If we're* *mainly concerned with sequential string indexing, then I think the encoding matters a lot less. In sequential processing, it comes down to two questions: 1. How expensive is "fetch next code point"? Which boils down to: how often did the code point require a multibyte encoding. Given the amount of western text that appears on asian web pages, it seems to me that the utf-8 encoding may actually be the right answer. 2. What's the memory cost we have to pay to reduce the number of complicated cases? The answer to that seems straightforward. I'll go further. My impression is the pattern in most applications is to do a "string seek" at an initial random position followed by sequential traversal. In something like a regexp engine you may back-track, but the backtracked positions are cacheable. IF I'm correct, then my sense is that an O(log s), where s is the number of "segments" in the string, is perfectly OK. And when that's the case, I think that a FancyString implementation consisting of a sequence of segments, where each segment has homogeneous encoding, works fine. It would certainly be fine for things like XPath and XQuery, which would seem to be the major customers these days. Unfortunately, this is one of those cases where benchmarks matter, and the benchmarks are driven by applications that don't do international string processing correctly. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
