On Sep 11, 2013, at 12:52 PM, "Jonathan S. Shapiro" <[email protected]> wrote:
> On Tue, Sep 10, 2013 at 7:38 PM, Bennie Kloosteman <[email protected]> wrote: > > Blowing 40-50% of heap space on 0x00 just leaves a bitter taste in my mouth. > > The data I've seen says it isn't that much. Do you have any actual data, or > is this speculative? > > In between . You have posted some comments on how much string data is in > heaps but this comment is for the apps i have build / used ( mainly web / > server and some Xaml) . A lot of apps have 80% of data in strings ( you > can easily see this in .NET with a heap inspector ) , images are normally > stored outside the heap , how much numeric data do you have ? > > At the moment, none. But back when we were designing the HaL architecture we > collected a bunch of data. At that time, only 20% of heap data (overall) was > string data. > > The catch is that the workloads have changed a lot since 1990, and I really > don't know where things stand today. I would certainly expect to see a much > higher proportion of the heap devoted to strings in current systems. Hell, > the web didn't exist then. Gore hadn't even been elected yet. :-) > > > Stepping out of the fray for a moment, it is my impression that there are > very few applications that require random string indexing. If we're mainly > concerned with sequential string indexing, then I think the encoding matters > a lot less. In sequential processing, it comes down to two questions: > > • How expensive is "fetch next code point"? Which boils down to: how > often did the code point require a multibyte encoding. Given the amount of > western text that appears on asian web pages, it seems to me that the utf-8 > encoding may actually be the right answer. > • What's the memory cost we have to pay to reduce the number of > complicated cases? The answer to that seems straightforward. > I'll go further. My impression is the pattern in most applications is to do a > "string seek" at an initial random position followed by sequential traversal. > In something like a regexp engine you may back-track, but the backtracked > positions are cacheable. IF I'm correct, then my sense is that an O(log s), > where s is the number of "segments" in the string, is perfectly OK. And when > that's the case, I think that a FancyString implementation consisting of a > sequence of segments, where each segment has homogeneous encoding, works > fine. It would certainly be fine for things like XPath and XQuery, which > would seem to be the major customers these days. > > Unfortunately, this is one of those cases where benchmarks matter, and the > benchmarks are driven by applications that don't do international string > processing correctly. If you need better than O(log s) for large strings, you can do hybrid interpolation/binary search, which will be O(log log s) for nearly all strings and O(log s) worst case (with an admittedly worse constant). Geoffrey
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
