>
>
>> At the moment, none. But back when we were designing the HaL architecture
> we collected a bunch of data. At that time, only 20% of heap data (overall)
> was string data.
>
> The catch is that the workloads have changed a lot since 1990, and I
> really don't know where things stand today. I would certainly expect to see
> a much higher proportion of the heap devoted to strings in current systems.
> Hell, the web didn't exist then. Gore hadn't even been elected yet. :-)
>
It has certainly changed (except for 3d games and a few other things ) .
Heap is integer data , references , strings and structure overhead ( eg
header) . byte[] are rarer than before and it may be that in C# and Java
these are much lower as the byte[] stays in native heaps with only blocks
processed . How much integer data is there , especially when each int
consumes just 2 characters.
Its not just web data though , ASCII is everywhere source code , XML , post
script files etc etc
>
> Stepping out of the fray for a moment, it is my impression that there are
> *very few* applications that require *random* string indexing. If we're*
> *mainly
> concerned with sequential string indexing, then I think the encoding
> matters a lot less. In sequential processing, it comes down to two
> questions:
>
>
> 1. How expensive is "fetch next code point"? Which boils down to: how
> often did the code point require a multibyte encoding. Given the amount of
> western text that appears on asian web pages, it seems to me that the utf-8
> encoding may actually be the right answer.
> 2. What's the memory cost we have to pay to reduce the number of
> complicated cases? The answer to that seems straightforward.
>
> I'll go further. My impression is the pattern in most applications is to
> do a "string seek" at an initial random position followed by sequential
> traversal. In something like a regexp engine you may back-track, but the
> backtracked positions are cacheable. IF I'm correct, then my sense is that
> an O(log s), where s is the number of "segments" in the string, is
> perfectly OK. And when that's the case, I think that a FancyString
> implementation consisting of a sequence of segments, where each segment has
> homogeneous encoding, works fine. It would certainly be fine for things
> like XPath and XQuery, which would seem to be the major customers these
> days.
>
I dont mind the cord implementations but to me these need benchmarks to be
the default type due to the reference / next_ptr cost and multiple length
, im pretty conviced UTF8 w slices will give better performance vc
C#strings in say 4 out of 5 microbenches and significant whole app
benefits . The reference / type check cost can grow and for runtime
variable length data your relying on a string underneath ( for the CLR) .
Still compare parse web or XML , UTF8 source created nodes with slices..
Allocation cost just the nodes ( with embedded slices) , with a corded
system you have a lot more complex string creation . How often do you have
large strings ? In C# large strings tend to be byte[] eg utf 8 web page .
Most large amounts of string data occurs in user designed structures nodes
, doc structure etc.
That said knowing a slice is ASCII allows great optomization and most
strings are small and will be a single encoding . . for larger strings its
less of an issue , just occassionally the next will follow a reference
and a diffirent code loop.
Another possibility is with ubiquitous use of slice we may not need
variable length strings creation , eg strings are often created in
char[] buffers just pass out the slice and pass that around , the other
frequent case is things like Trim() , SubStr etc again a slice deals with
nearly all of these. Not sure if it covers all cases.
>
> Unfortunately, this is one of those cases where benchmarks matter, and the
> benchmarks are driven by applications that don't do international string
> processing correctly.
>
Agree, these projects live and die by arbritary benchmarks , hence my
obsession on the runtime string impl .
Mmm why do we even need "string" why not seperate logic and storage
operation (eg char[] /fixed char[] / string/ byte[] /cordString /
nativebyte[] with dispose as the data ) and slice for the uniform string
operation .. Obviously the reason here is unsafe interop and reduce the
huge amount of small strings that are created but it would also make the
API less dependent on storage so you can benchmark and change. In
addition interop with existing C# code would be easy .
Is this bad ( maybe with some better sugar)
Slice name = myObj->Name // with an implicit cast
if ( Slice.IsEmpty(myObj->Name )
if ( myObj->Name.Contains("tes")) // "tes: is slice so implicit
conversion , is it smart enough to convert Name to slice to find tthe
Contains method C# cant ?
if ( myObj->Name == "tes")
but
myObj->Name.Trim () // illegal
becomes
Slice name = myObj->Name
var trimmedName = name.Trim();
Has anyone seen an API that used string slices almost exclusively ?
So we incur the cost of a value type ( slice) both in terms of creation and
pass by value and some confusion/ learning from developers. We gain
flexability , non nullable "strings" ( since slice will be non boxed) ,
great native interop ( and dont underestimate this ) , far less copying ,
less creation and hence improved memory performance for string heavy work
and we prevent C# libs "string" becomming our de facto string.
Ben
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev