On 20/09/10 01:55, Bruno Haible wrote: > [CCing bug-libunistring. This is a reply to > <http://lists.gnu.org/archive/html/coreutils/2010-09/msg00032.html>.] > > Hi Pádraig, > >> I was doing some performance analysis of the above patch >> and noticed it performed very well usually but not when >> ulc_casecoll or ulc_casecmp were called, when not in a >> UTF-8 locale. In fact it seemed to slow down everything >> by about 5 times? >> >> $ seq -f%010.0f 30000 | tee f1 > f2 >> >> # with ulc_casecmp >> $ time LANG=en_US join -i f1 f2 >/dev/null >> real 0m1.281s >> >> # with ulc_casecoll >> $ time LANG=en_US join -i f1 f2 >/dev/null >> real 0m2.120s >> >> # with ulc_casecmp in UTF8 locale >> $ time LANG=en_US.utf8 join -i f1 f2 >/dev/null >> real 0m0.260s >> >> # with ulc_casecoll in UTF8 locale >> $ time LANG=en_US.utf8 join -i f1 f2 >/dev/null >> real 0m0.437s > > Thanks for the test case. I've produced a similar test case that operates > on the same data, in a way similar to 'join', and profiled it: > $ valgrind --tool=callgrind ./a.out > $ kcachegrind callgrind.out.* > >> # Doing encoding outside gives an indication >> # what the performance should be like: >> # with ulc_casecoll in UTF8 locale >> $ time LANG=en_US.utf8 join -i <(iconv -fiso-8859-1 -tutf8 f1) \ >> <(iconv -fiso-8859-1 -t utf8 f2) >/dev/null >> real 0m0.462s > > This is an unfair comparison, because it throws overboard the requirement > that 'join' has to be able to deal with input that 'iconv' would not accept. > >> A quick callgraph of ulc_casecoll gives: >> >> ulc_casecoll(s1,s2) >> ulc_casexfrm(s1) >> u8_conv_from_encoding() >> mem_iconveha(from,"UTF8",translit=true) >> mem_iconveh() >> iconveh_open() >> mem_cd_iconveh() >> mem_cd_iconveh_internal() >> iconv() >> iconveh_close() >> u8_casexfrm >> u8_ct_casefold() >> u8_casemap() >> u8_conv_to_encoding() >> ... >> memxfrm() >> ulc_casexfrm(s2) >> memcmp(s1,s2) > > Unfortunately such a callgraph is not quantitative. I needed a > profiler to really make sense out of it. > > The profiling showed me three things: > - The same string is being converted to UTF-8, then case-folded > according to Unicode and locale rules, then converted back to locale > encoding, and then strxfrm'ed, more than once. CPU time would be > minimized (at the expense of more memory use) if this was done only > once for each input line. It is not clear at this point how much > it must be done in join.c and how much help for this can be added > to libunistring.
Well my first thought is that libunistring might expose a version of these "conv" functions with a context parameter to reference the state. > - There are way too many iconv_open calls. In particular, while > ulc_casecmp and ulc_casecoll have to convert two string arguments > and could use the same iconv_t objects for both arguments, they > open fresh iconv_t objects for each argument. > - Additionally, every call to iconveh_open with target encoding UTF-8 > calls iconv_open twice, where once would be sufficient. Good stuff. >> So one can see the extra overhead involved when not in UTF-8. >> Seems like I'll have to efficiently convert fields to utf8 >> internally first before calling u8_casecoll()? > > If there was a guarantee that all strings can be converted to UTF-8, yes. > But there is no such guarantee for 'join'. The keycmp function in > <http://lists.gnu.org/archive/html/coreutils/2010-09/msg00029.html> is > therefore actually wrong: It does not always satisfy the requirement > keycmp(A,B) < 0 && keycmp(B,C) < 0 ==> keycmp(A,C) < 0 > The required handling of input strings that are not valid in the locale > encoding is more elaborate that I thought. I need to think more about > this issue. These are the sort of edge cases I was worrying about when looking at normalization and punted until later. cheers, Pádraig.