On Sep 11, 2013, at 12:52 PM, "Jonathan S. Shapiro" <[email protected]> wrote:

> On Tue, Sep 10, 2013 at 7:38 PM, Bennie Kloosteman <[email protected]> wrote:
>  
> Blowing 40-50%  of heap space on 0x00 just leaves a bitter taste in my mouth.
> 
> The data I've seen says it isn't that much. Do you have any actual data, or 
> is this speculative? 
> 
> In between   .  You have posted some comments on how much string data is in 
> heaps  but this comment is for the apps i have build / used ( mainly web / 
> server and some Xaml) .   A lot of apps have 80% of data in strings  ( you 
> can easily see this in .NET with a heap inspector )  , images are normally 
> stored outside the heap ,  how   much numeric data do you have ? 
> 
> At the moment, none. But back when we were designing the HaL architecture we 
> collected a bunch of data. At that time, only 20% of heap data (overall) was 
> string data.
> 
> The catch is that the workloads have changed a lot since 1990, and I really 
> don't know where things stand today. I would certainly expect to see a much 
> higher proportion of the heap devoted to strings in current systems. Hell, 
> the web didn't exist then. Gore hadn't even been elected yet. :-)
> 
> 
> Stepping out of the fray for a moment, it is my impression that there are 
> very few applications that require random string indexing. If we're mainly 
> concerned with sequential string indexing, then I think the encoding matters 
> a lot less. In sequential processing, it comes down to two questions:
> 
>       • How expensive is "fetch next code point"? Which boils down to: how 
> often did the code point require a multibyte encoding. Given the amount of 
> western text that appears on asian web pages, it seems to me that the utf-8 
> encoding may actually be the right answer.
>       • What's the memory cost we have to pay to reduce the number of 
> complicated cases? The answer to that seems straightforward.
> I'll go further. My impression is the pattern in most applications is to do a 
> "string seek" at an initial random position followed by sequential traversal. 
> In something like a regexp engine you may back-track, but the backtracked 
> positions are cacheable. IF I'm correct, then my sense is that an O(log s), 
> where s is the number of "segments" in the string, is perfectly OK. And when 
> that's the case, I think that a FancyString implementation consisting of a 
> sequence of segments, where each segment has homogeneous encoding, works 
> fine. It would certainly be fine for things like XPath and XQuery, which 
> would seem to be the major customers these days.
> 
> Unfortunately, this is one of those cases where benchmarks matter, and the 
> benchmarks are driven by applications that don't do international string 
> processing correctly.

If you need better than O(log s) for large strings, you can do hybrid 
interpolation/binary search, which will be O(log log s) for nearly all strings 
and O(log s) worst case (with an admittedly worse constant).

Geoffrey

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to