Hi Jamie, > Just to make it clear, I believe the tracker way of iterating over the > characters is the optimal solution as it allows us to detect encoding, > keep track of length, perform word validation (aforementioned parse > rules concerning starting character) and perform word breaking without > any further character iterations (unlike the unicode implementations) >
I wouldn't be worried for the extra character iteration looking for ASCII characters, as that operation is just a comparison per byte until a non-ASCII found in the word. In the other hand, for example, the current glib-based parser needs conversion of each UTF-8 character found in the per-character iteration to 32-bit gunichars (g_utf8_get_char() calls) and back to UTF-8 (g_ucs4_to_utf8() call), which is not needed if using libunistring. Also, if NFC normalization needs to be applied to the whole string before using the glib/pango parser (as it seems), storing the offsets of all the characters in the original non-normalized string (for the offsets() and snippet() FTS stuff), that's an extra iteration that would be needed for glib/pango parser, not needed with libunistring or libicu as the word-break algorithm works nice with NFD strings. All three algorithms have good points and bad points, so that's why they should be compared as a whole, I would say. Anyway I agree that the fastest and perfect solution would be the one doing all the needed things in a single iteration: NFC normalization, word-break detection, a proper case-folding (not character-per-character!)... even accent stripping and stemming could be done if we were to develop such function (and that would really actually be a great performance improvement, btw), but that is probably a huge work only useful for the Tracker case, and very difficult to maintain. > Therefore one question I would ask if the unicode libs provide ability > to check if a single character is a word break? > Actually, that's the wrong way of doing it :-) Unicode states "where not to break a word"-rules, and if none of them apply to the given case, "break everywhere". It's explained in UAX#29: http://unicode.org/reports/tr29/#Default_Word_Boundaries There are something like 14 rules to check before deciding that there is a word-break between two unicode characters (thus, not considering unicode characters themselves to be word-breakers). > If so it could be added to the existing tracker parser with little > modification and overhead (and only used if word is non-ascii of > course!) > That would mean re-implementing the word-break algorithm from libunicode or libicu instead of re-using the originals :-) BTW, it seems chinese-strings are currently not properly word-breaked by the glib/pango parser. Pango is not being used, I mean. Probably the ranges of CJK characters need to be reviewed. > Also the pango way of allocation structs to store attributes for each > character is indeed crazy and insanely slow but fortunatley we only use > that for CJK text but in any case we could indeed optimize that area > with some of the work you have done > I really wouldn't split between non-CJK and CJK, if the performance of ASCII is comparable using libunistring/libicu (which it seems it is). The best thing of libunistring/libicu based parsers is really that there is a single algorithm for any string, whatever characters they have, and maintaining such algorithms should be trivial compared to the glib/pango case. Also, the split algorithm for non-CJK and CJK would again be faulty for documents with strings in both English and Chinese for example. Probably not the case in my computer or yours, but a really high chance in a Japanese's or Chinese's computer. Anyway, tomorrow I will spend some time doing additional tests for the ASCII-only case, and will try to compare the three parsers in this specific situation. Cheers! -- Aleksander _______________________________________________ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list