On Sun, Sep 8, 2013 at 2:38 PM, Jonathan S. Shapiro <[email protected]>wrote:

>
>> All true... However, a separate payload incurs double the cache-misses on
>> access...
>>
>
> Actually, it usually doesn't. The touch in the object header is mandatory,
> and loading the indirection pointer can be done in the same cache
> reference. The touches to access the string payload are also mandatory.
>

If the payload is embedded, then (the start of it) is in the same cacheline
as the object-header and length, thus one cache-line-load. If you move the
payload elsewhere in memory, then there is a second cache-line loaded for
the start of payload. Whether this is relevant or not is related to
string-length.

 The major problem in my mind is having mountains of useful library code
>>> suddenly become unusable when you decide the default UCS2 binary
>>> representation is too much overhead for your application. It's easy enough
>>> to make your own UTF-8 string (separate vs embedded payload issues asside),
>>> however, it can't be worked on by the regex library, despite being able to
>>> produce a compatible stream of "char".
>>>
>>
> No. The major problem is that you're still operating on the wrong
> definition of "character", and therefore failing to acknowledge that
> changing between an 8 bit or a 16 bit representation of characters is
> addressing the wrong issue entirely.
>

I can do plenty useful things with regex-on-bytes, regardless of the string
charset/encoding. However, I can't use efficiently it if the regex library
takes the native string type which is UCS2 and there is no type parametric
allowing me to use a byte-stream instead.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to