Hi all, I sent a patch for GB#619244 (https://bugzilla.gnome.org/show_bug.cgi?id=619244) in order to drop the libunac dependency in tracker.
The patch provides a new per-parser unaccenting method, using the same logic as the one in libunac, but not needing the explicit conversion to/from UTF-16BE. In a brief, results show that in the best case, the parsing goes up to 73% faster, while in the worst case where no unaccenting needs to be done, results remain pretty the same. Given times are the best ones after 10-15 tries with the same file, and test files are available at: http://www.lanedo.com/~aleksander/gnome-tracker/tracker-unaccent-tests/unaccent-tests.tgz 1) File: accents-big.txt * unaccenting needed in ALL words * contains NFC and NFD text * size = 2.7MBytes libunac + glib/pango --> 1.383s libunac + libunistring --> 2.295s libunac + libicu --> 1.877s custom-unaccent + glib/pango --> 0.587s (58% faster than libunac) custom-unaccent + libunistring --> 0.822s (65% faster than libunac) custom-unaccent + libicu --> 0.519s (73% faster than libunac) So, if unaccenting needs to be done in ALL words, the libicu parser with custom unaccenting method is the fastest one. This, anyway, is a corner case as never is really needed unaccenting in all words of a given file, but at least it shows how faster it goes. 2) File: mixed-big.txt * unaccenting needed only in some words * contains mixed languages * size = 2.7MBytes libunac + glib/pango --> ...several minutes libunac + libunistring --> 0.648s libunac + libicu --> 0.929s custom-unaccent + glib/pango --> ...several minutes custom-unaccent + libunistring --> 0.386s (41% faster than libunac) custom-unaccent + libicu --> 0.636s (32% faster than libunac) In this case, where only some words need unaccenting (a more general case than previous one), the libunistring parser with custom unaccenting method is the fastest one. glib/pango parser doesn't perform ok with this file. 3) File: ascii-big.txt * no unaccenting needed * contains ASCII only * size = 2.7MBytes libunac + glib/pango --> 0.545s libunac + libunistring --> 0.253s libunac + libicu --> 0.630s custom-unaccent + glib/pango --> 0.495s (10% faster than libunac) custom-unaccent + libunistring --> 0.252s (almost same) custom-unaccent + libicu --> 0.628s (almost same) In theory, this tests should give more or less same times for both libunac and custom unaccenting methods, and that's the case for the libicu and libunistring parsers; but the glib/pango one seems 10% faster if using the custom unaccenting method. The reason for this seems to be that the glib/pango implementation doesn't seem to skip unaccenting if string is ASCII-only. Cheers, -- Aleksander _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
