Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

Jamie McCracken Thu, 22 Apr 2010 09:52:05 -0700

On Thu, 2010-04-22 at 18:34 +0200, Aleksander Morgado wrote:
> Hi all!
> 
> I'm currently analyzing the issue reported at GB#579756 (Unicode
> Normalization is broken in Indexer and/or Search):
> https://bugzilla.gnome.org/show_bug.cgi?id=579756
> 
> All my comments below apply to the contents of nie:plainTextContent, not
> really directly related to the bug report, which may still be some issue
> in the FTS algorithm.
> 
> 
> Normalization:
> 
> Shouldn't tracker use a single Unicode normalization form for the list
> of words stored in nie:plainTextContent? For text search, a decomposed
> form would probably be preferred, like NFD. This would mean calling
> g_utf8_normalize() with G_NORMALIZE_NFD argument for each string to be
> added in nie:plainTextContent.
> 
> 
> Word breaks:
> 
> When text content is extracted from several doc types (msoffice, oasis,
> pdf...), a simple word break algorithm is used, basically looking for
> letters. This algorithm is far from perfect, as it doesn't follow the
> common rules for word-breaking in UAX#29
> http://unicode.org/reports/tr29/#Word_Boundaries .
> 
> As an example, a file containing the following 3 strings (english 1st,
> chinese second, japanese-katakana last):
> "Simple english text\n
> 本州最主流的风味，使用日本酱油、鸡肉和蔬菜。可隨個人喜好加入油辣和胡椒。
> \n
> ホモ・サピエンス"
> 
> With the current algorithm (tracker_text_normalize() in
> libtracker-extract), only 10 words are found, and separated with
> whitespaces in the following way:
> "Simple english text 本州最主流的风味 使用日本酱油 鸡肉和蔬菜 可隨個人喜
> 好加入油辣和胡椒  ホモ サピエンス"
> 
> While with a proper word-break detection algorithm, you would find 37
> correct words:
> "Simple english text 本 州 最 主 流 的 风 味 使 用 日 本 酱 油 鸡 肉 和
> 蔬 菜 可 隨 個 人 喜 好 加 入 油 辣 和 胡 椒  ホモ サピエンス"
> 
> Chinese symbols are considered separate words, while katakana symbols
> are not. This is just an example of how a proper word detection should
> be done.
> 
> I already have a custom version of tracker_text_normalize() which
> properly does the word-break detection, using GNU libunistring. Now, if
> applied, should libunistring be a mandatory dependency for tracker?
> Another option would probably be using pango, but I doubt pango is a
> good dependency for libtracker-extract.



word break detection is done in
http://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-parser.c

THis is highly optimised and does checks for Plain ASCII/Latin/CJK
encodings to determine which word breaking algorithm to use

For CJK we always use pango to word break as this is believed to be
correct (although too slow to use for non-CJK)

I dont know why tracker_text_normalize() exists or why its used instead
of the above but clearly if the tracker-parser one is correct then it
should be using that one. (the parser also does NFC normalization)

Of course I cant understand why normalization needs to be done prior to
the parsing - surely only utf8 validation needs doing there (re
normalizing just wastes cpu)

jamie

_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

Reply via email to