Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)

Jamie McCracken Wed, 05 May 2010 08:12:49 -0700

On Wed, 2010-05-05 at 12:53 +0200, Aleksander Morgado wrote:
> Hi Jamie & all,
> 
> > 
> > I will modify the libunistring and libicu based algorithms tomorrow so
> > that if ASCII-7 only, normalization and casefolding is not done, just a
> > tolower() of each character. That would make the values more approximate
> > to the glib/custom parser.
> > 
> 
> Just finished the ASCII-only improvement in both libunistring and
> libicu, and here are the new results. This time instead of the mean
> value of several tests, I took the minimum one.
> 
> For the 50k ASCII-only file:
>  * glib/pango:   0.062
>  * libicu:       0.060
>  * libunistring: 0.057
> 
> For the 200k ASCII-only file:
>  * glib/pango:   0.189
>  * libicu:       0.200
>  * libunistring: 0.119
> 
> And for the 182k mixed english/chinese/japanese file:
> * glib/pango:   21.4
> * libicu:        0.220
> * libunistring:  0.175
> 
> So, with this improvement considering ASCII-only words a special case,
> libunistring really beats them all.
> 
 
yeah libunistring looks like good stuff - I must check the source!


I still note you need to apply word filtering rules on words beginning
with numbers or symbols - Im sure thats easy to do?

thanks

jamie

_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] libicu & libunistring based parsers (was:Re: libunistring-based parser in libtracker-fts)

Reply via email to