On Tue, 2010-05-04 at 22:13 +0200, Aleksander Morgado wrote: > > But apart from that, the performance difference between the glib-parser > tests and the unicode-based-parsers are really not comparable: If all > processed the same number of words, it really seems that both > libunistring-based one and libicu-based one would behave better even for > ASCII, and all the normalization and case-folding issues would be > solved, and using a single implementation for any kind of input string > (even with mixed CJK and non-CJK). >
Its more likely tracker pre-checking the encoding to decide whether to use pango or not is causing too much overhead especailly if input string is small The ideal solution IMO would still be for tracker to perhaps remove the pre-check, iterate and use current ASCII or libunistring/libicu depending on encoding of the current word. It should be easy to remove pango and pass non-ascii stuff to be treated differently For Ascii, we just do what it currently does (iterate, break, convert to lowercase and validate without any further iterations). Theres no need for Normalization or any other treatments so it should be optimal as can be. It would indeed be interesting to see how that benchmarks with your unicode stuff For Non-Ascii, we can easily add your libunistring-based stuff. The parser upon hitting a non-ascii character simply rollsback and passes the start of the word to your unicode libs and does the additional normalization and UNAC steps Thats probably the easiest way to get best of both worlds (assuming theres a significant difference between tracker and unicode libs for ASCII) Obviously if everything could be done in a single iteration then it would rock but as you say that might be a lot of work jamie _______________________________________________ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list