Hi Jamie,

> Just to make it clear, I believe the tracker way of iterating over the
> characters is the optimal solution as it allows us to detect encoding,
> keep track of length, perform word validation (aforementioned parse
> rules concerning starting character)  and perform word breaking without
> any further character iterations (unlike the unicode implementations)
> 

I wouldn't be worried for the extra character iteration looking for
ASCII characters, as that operation is just a comparison per byte until
a non-ASCII found in the word. In the other hand, for example, the
current glib-based parser needs conversion of each UTF-8 character found
in the per-character iteration to 32-bit gunichars (g_utf8_get_char()
calls) and back to UTF-8 (g_ucs4_to_utf8() call), which is not needed if
using libunistring. Also, if NFC normalization needs to be applied to
the whole string before using the glib/pango parser (as it seems),
storing the offsets of all the characters in the original non-normalized
string (for the offsets() and snippet() FTS stuff), that's an extra
iteration that would be needed for glib/pango parser, not needed with
libunistring or libicu as the word-break algorithm works nice with NFD
strings.

All three algorithms have good points and bad points, so that's why they
should be compared as a whole, I would say.

Anyway I agree that the fastest and perfect solution would be the one
doing all the needed things in a single iteration: NFC normalization,
word-break detection, a proper case-folding (not
character-per-character!)... even accent stripping and stemming could be
done if we were to develop such function (and that would really actually
be a great performance improvement, btw), but that is probably a huge
work only useful for the Tracker case, and very difficult to maintain.

> Therefore one question I would ask if the unicode libs provide ability
> to check if a single character is a word break? 
> 

Actually, that's the wrong way of doing it :-) Unicode states "where not
to break a word"-rules, and if none of them apply to the given case,
"break everywhere". It's explained in UAX#29:
http://unicode.org/reports/tr29/#Default_Word_Boundaries

There are something like 14 rules to check before deciding that there is
a word-break between two unicode characters (thus, not considering
unicode characters themselves to be word-breakers).

> If so it could be added to the existing tracker parser with little
> modification and overhead (and only used if word is non-ascii of
> course!)
> 

That would mean re-implementing the word-break algorithm from libunicode
or libicu instead of re-using the originals :-)

BTW, it seems chinese-strings are currently not properly word-breaked by
the glib/pango parser. Pango is not being used, I mean. Probably the
ranges of CJK characters need to be reviewed.

> Also the pango way of allocation structs to store attributes for each
> character is indeed crazy and insanely slow but fortunatley we only use
> that for CJK text but in any case we could indeed optimize that area
> with some of the work you have done
> 

I really wouldn't split between non-CJK and CJK, if the performance of
ASCII is comparable using libunistring/libicu (which it seems it is).
The best thing of libunistring/libicu based parsers is really that there
is a single algorithm for any string, whatever characters they have, and
maintaining such algorithms should be trivial compared to the glib/pango
case.

Also, the split algorithm for non-CJK and CJK would again be faulty for
documents with strings in both English and Chinese for example. Probably
not the case in my computer or yours, but a really high chance in a
Japanese's or Chinese's computer.

Anyway, tomorrow I will spend some time doing additional tests for the
ASCII-only case, and will try to compare the three parsers in this
specific situation.

Cheers!

-- 
Aleksander


_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to