On Sun, Sep 8, 2013 at 2:38 PM, Jonathan S. Shapiro <[email protected]>wrote:
> >> All true... However, a separate payload incurs double the cache-misses on >> access... >> > > Actually, it usually doesn't. The touch in the object header is mandatory, > and loading the indirection pointer can be done in the same cache > reference. The touches to access the string payload are also mandatory. > If the payload is embedded, then (the start of it) is in the same cacheline as the object-header and length, thus one cache-line-load. If you move the payload elsewhere in memory, then there is a second cache-line loaded for the start of payload. Whether this is relevant or not is related to string-length. The major problem in my mind is having mountains of useful library code >>> suddenly become unusable when you decide the default UCS2 binary >>> representation is too much overhead for your application. It's easy enough >>> to make your own UTF-8 string (separate vs embedded payload issues asside), >>> however, it can't be worked on by the regex library, despite being able to >>> produce a compatible stream of "char". >>> >> > No. The major problem is that you're still operating on the wrong > definition of "character", and therefore failing to acknowledge that > changing between an 8 bit or a 16 bit representation of characters is > addressing the wrong issue entirely. > I can do plenty useful things with regex-on-bytes, regardless of the string charset/encoding. However, I can't use efficiently it if the regex library takes the native string type which is UCS2 and there is no type parametric allowing me to use a byte-stream instead.
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
