Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

2010-04-23 Thread Jamie McCracken
On Fri, 2010-04-23 at 09:17 +0100, Martyn Russell wrote: > Thanks Aleksander. > > I think it makes sense to fix this. Just to be clear, does this mean we > don't need Pango in libtracker-fts/tracker-parser.c to determine word > breaks for CJK? Thats not broken so would not recommend trying to

Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

2010-04-23 Thread Aleksander Morgado
Hi Martyn, > > I think it makes sense to fix this. Just to be clear, does this mean we > don't need Pango in libtracker-fts/tracker-parser.c to determine word > breaks for CJK? > Well, of course not sure about this. I understand the need of word-breaking in libtracker-fts, but I could also un

Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

2010-04-23 Thread Martyn Russell
On 22/04/10 17:34, Aleksander Morgado wrote: Hi all! Hi, Word breaks: When text content is extracted from several doc types (msoffice, oasis, pdf...), a simple word break algorithm is used, basically looking for letters. This algorithm is far from perfect, as it doesn't follow the common rul

Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

2010-04-23 Thread Aleksander Morgado
Hi Jamie, > > word break detection is done in > http://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-parser.c > > THis is highly optimised and does checks for Plain ASCII/Latin/CJK > encodings to determine which word breaking algorithm to use > > For CJK we always use pango to wo