Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)

Jamie McCracken Tue, 04 May 2010 14:19:29 -0700

On Tue, 2010-05-04 at 22:13 +0200, Aleksander Morgado wrote:

> 
> But apart from that, the performance difference between the glib-parser
> tests and the unicode-based-parsers are really not comparable: If all
> processed the same number of words, it really seems that both
> libunistring-based one and libicu-based one would behave better even for
> ASCII, and all the normalization and case-folding issues would be
> solved, and using a single implementation for any kind of input string
> (even with mixed CJK and non-CJK).
>


Its more likely tracker pre-checking the encoding to decide whether to
use pango or not is causing too much overhead especailly if input string
is small

The ideal solution IMO would still be for tracker to perhaps remove the
pre-check, iterate and use current ASCII or libunistring/libicu
depending on encoding of the current word. It should be easy to remove
pango and pass non-ascii stuff to be treated differently

For Ascii, we just do what it currently does (iterate, break, convert to
lowercase and validate without any further iterations). Theres no need
for Normalization or any other treatments so it should be optimal as can
be. It would indeed be interesting to see how that benchmarks with your
unicode stuff

For Non-Ascii, we can easily add your libunistring-based stuff. The
parser upon hitting a non-ascii character simply rollsback and passes
the start of the word to your unicode libs and does the additional
normalization and UNAC steps

Thats probably the easiest way to get best of both worlds (assuming
theres a significant difference between tracker and unicode libs for
ASCII)

Obviously if everything could be done in a single iteration then it
would rock but as you say that might be a lot of work

jamie 



_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)

Reply via email to