> 11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <[email protected]>: > >> >> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <[email protected]>: >> >> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <[email protected]> >> wrote: >>> They've been used for the FreeBSD ports since 2005, and have served us >>> well. I have of course updated them regularly. In this latest version, I've >>> removed support for other encodings beside UTF-8, mostly since I don't know >>> how to test them, but also, I see little point in supporting anything else >>> using ICU. >> >> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5 >> is not the release that introduced its use, it did expand it >> significantly. I think you need to fix this, even though it isn't >> actually used to sort text at present, since presumably FreeBSD builds >> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could >> reasonably trust the ICU equivalent of strxfrm(), so that's a missed >> opportunity. (See the wiki page on the abbreviated keys issue [1] if >> you don't know what I'm talking about.) > > My plan was to get it working without TRUST_STRXFRM first, and then add that > functinality. I've made some preliminary tests using ICU:s ucol_getSortKey > but I will have to test it a bit more. For now, I just expect not to trust > strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code > base.
Here are some preliminary results running 10000 times comparing the same two
strings in a tight loop.
ucol_strcollUTF8: -1 0.002448
strcoll: 1 0.060711
ucol_strcollIter: -1 0.009221
direct memcmp: 1 0.000457
memcpy memcmp: 1 0.001706
memcpy strcoll: 1 0.068425
nextSortKeyPart: -1 0.041011
ucnv_toUChars + getSortKey: -1 0.050379
correct answer is -1, but since we compare åasdf and äasdf with a Swedish
locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5
times faster than ucol_strcollUTF8 (used in my patch), but sadly the best
implementation using sort keys with ICU, nextSortKeyPart, is way slower.
startTime = getRealTime();
for ( int i = 0; i < loop; i++) {
result = ucol_strcollUTF8(collator, arg1, len1, arg2, len2,
&status);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime -
startTime);
vs
int sortkeysize=8;
startTime = getRealTime();
uint8_t key1[sortkeysize], key2[sortkeysize];
uint32_t sState[2], tState[2];
UCharIterator sIter, tIter;
for ( int i = 0; i < loop; i++) {
uiter_setUTF8(&sIter, arg1, len1);
uiter_setUTF8(&tIter, arg2, len2);
sState[0] = 0; sState[1] = 0;
tState[0] = 0; tState[1] = 0;
ucol_nextSortKeyPart(collator, &sIter, sState, key1,
sortkeysize, &status);
ucol_nextSortKeyPart(collator, &tIter, tState, key2,
sortkeysize, &status);
result = memcmp (key1, key2, sortkeysize);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime -
startTime);
But in your strxfrm code in PostgreSQL, the keys are cached, and represented as
int64:s if I remember correctly, so perhaps there is still a benefit using the
abbreviated keys? More testing is required, I guess...
Palle
signature.asc
Description: Message signed with OpenPGP using GPGMail
