On 06/19/2012 09:08 PM, Eric Blake wrote: > On 06/19/2012 09:43 AM, Orit Wasserman wrote: >> Signed-off-by: Benoit Hudzia <benoit.hud...@sap.com> >> Signed-off-by: Petter Svard <pett...@cs.umu.se> >> Signed-off-by: Aidan Shribman <aidan.shrib...@sap.com> >> Signed-off-by: Orit Wasserman <owass...@redhat.com> >> --- >> savevm.c | 91 >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 1 files changed, 91 insertions(+), 0 deletions(-) >> > >> +int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen, >> + uint8_t *dst, int dlen) >> +{ >> + uint32_t zrun_len = 0, nzrun_len = 0; >> + int d = 0 , i = 0; >> + uint8_t *nzrun_start = NULL; >> + >> + while (i < slen) { >> + /* overflow */ >> + if (d + 2 > dlen) { >> + return -1; >> + } >> + >> + while (!(old_buf[i] ^ new_buf[i]) && ++i <= slen) { >> + zrun_len++; >> + } > > Can we vectorize this to compare a word at a time for improved speed? > It would be really cool if glibc had an interface like memcmp(), but > which returned the offset of the first difference instead of a > comparison of the overall memory regions. But even without such an > interface, pseudocode such as: > > while (i < slen) { > check for overflow > if (not aligned) > up to sizeof(long) byte comparisons > while (long comparisons && not end) > increment i by sizeof(long) > if (end of buffer) { > either buffer unchanged, or we have trailing zrun > break; > } > // found a word that differed > do byte-wise comparisons on that word > then back to word-size xor, checking for no zero bytes in the xor (or > anywhere that a zero byte occurs in isolation, it is more compact to > pass that lone byte as part of the xor encoding instead of breaking out > to a zrun) > } I will add it. > >> + >> + zrun_len = 0; >> + nzrun_start = new_buf + i; >> + while ((old_buf[i] ^ new_buf[i]) != 0 && ++i <= slen) { >> + nzrun_len++; >> + } > > Note that the encoding for a single unchanged byte with bytes changed on > both sides is more compact to pass the unchanged byte as part of the > nzrun, instead of breaking out into a zrun and starting a second nzrun. > If I'm analyzing things correctly, that's also true for a run of two > bytes, and it's not until we get to a run of three or more unchanged > bytes where we start to get compression (except for the corner case of > ending on a 1- or 2-byte zrun). Should we be altering this algorithm to > take advantage of this?
I used to have this optimization but didn't see any noticeable performance improvement. It made the code much more complicated and less readable so I chose to remove it. I think adding SSE support will have better chance in improving the compression performance. > >> +int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen) >> +{ >> + int i = 0, d = 0; >> + uint32_t count = 0; >> + >> + while (i < slen) { >> + >> + /* zrun */ >> + i += uleb128_decode_small(src + i, &count); >> + d += count; >> + >> + /* overflow */ >> + g_assert(d <= dlen); > > Again, is g_assert() appropriate for parsing user-supplied input, or are > there more appropriate ways for gracefully failing an incoming migration > attempt from corrupt data? I will change it to return -1. > >> + >> + /* completed decoding */ >> + if (i == slen - 1) { >> + return d; >> + } >> + >> + /* nzrun */ >> + i += uleb128_decode_small(src + i, &count); >> + >> + g_assert(d + count <= dlen); >> + >> + memcpy(dst + d , src + i, count); > > Am I correct that you are using XOR to compute the locations of nzrun, > but then transferring the original data (and not the xor data) over the > wire? This was not made clear in the doc patch of 3/13. I would > suggest adding an example to the docs of a 4k page that consists only of > a short string at the beginning of the page (and NUL bytes thereafter), > then show both the before and after version of that page to demonstrate > sparse memory changes, as well as the XBZRLE encoding of that page. > ok. Orit