> On 12/11/2015 03:49, Li, Liang Z wrote: > > I am very surprised about the live migration performance result when > > I use your ' memeqzero4_paolo' instead of these SSE2 Intrinsics to > > check the zero pages. > > What code were you using? Remember I suggested using only unsigned long > checks, like > > unsigned long *p = ... > if (p[0] || p[1] || p[2] || p[3] > || memcmp(p+4, p, size - 4 * sizeof(unsigned long)) != 0) > return BUFFER_NOT_ZERO; > else > return BUFFER_ZERO; >
I use the following code: bool memeqzero4_paolo(const void *data, size_t length) { const unsigned char *p = data; unsigned long word; if (!length) return true; /* Check len bytes not aligned on a word. */ while (__builtin_expect(length & (sizeof(word) - 1), 0)) { if (*p) return false; p++; length--; if (!length) return true; } /* Check up to 16 bytes a word at a time. */ for (;;) { memcpy(&word, p, sizeof(word)); if (word) return false; p += sizeof(word); length -= sizeof(word); if (!length) return true; if (__builtin_expect(length & 15, 0) == 0) break; } /* Now we know that's zero, memcmp with self. */ return memcmp(data, p, length) == 0; } > > The total live migration time increased about > > 8%! Not decreased. Although in the unit test your ' > > memeqzero4_paolo' has better performance, any idea? > > You only tested the case of zero pages. But real pages usually are not zero, > even if they have a few zero bytes at the beginning. It's very important to > optimize the initial check before the memcmp call. > In the unit test, I only test zero pages too, and the performance of 'memeqzero4_paolo' is better. But when merged into QEMU, it caused performance drop. Why? > Paolo