On string data types

Iustin Pop Sun, 05 Jun 2016 01:32:44 -0700

Hi all,

As discussed previously, I took a look at the feasibility of changing
UUID references to bytestrings or short bytestrings.  While doable,
there are two problems with this.


The first is that we don't have performance/load benchmarks that would
tell us how the overall behaviour changes after such a change.  I've
tried to look at the unittests (test/hs/htest) as a proxy, but it
doesn't give any useful measures (± a few percents).  The individual
benchmarks (e.g. getNodeInstances) do show significant improvement, but
how relevant are these?

The second one is that there are two kinds of inefficiencies: basic
costs (which are higher in terms of memory for String versus something
else) and conversion costs (e.g. needing to go back to String from
ByteString).  While basic costs are/can be high, a conversion cost in a
tight loop can be even higher (hence why getNodeInstances is very slow
right now).  So whatever data type change we do, we should make sure
that the switch is complete, so that at runtime we don't have to pay any
conversion costs (as much as possible).

Given both of these, I think that fixing stable 2.16 is risky; I think a
better approach would be to forego any current release (2.17 seems to be
in beta and has a stable branch, so not feasible either), and focus on
large scale changes in master:

- switch for parsing from String+JSON to ByteString+Aeson; this should
  give nontrivial speed and memory improvements
- decide on whether to use a single data type for both UUIDs and object
  names (e.g. Text) or use a split model (ByteStrings vs. Text or
  ShortByteString vs. Text)
- convert all object fields according to the above
- convert all internal data paths to not use String anymore

This would be orthogonal to any algorithmic changes (e.g. hash consing
or similar), which are needed for overall memory use (whereas string
type changes would be useful for localised memory usage and lower cpu
usage due to less conversions).

What do you think?

regards,
iustin

On string data types

Reply via email to