Re: Plain-text search algorithms: normalization, decomposition, case mapping, word breaks

Ben Dougall Fri, 27 Jun 2003 08:30:02 -0700

i'm a bit confused. i thought that this type of thing was already pretty well covered by the various unicode resources? (i guess there's a strong chance not, if you're asking this question).

this is the way i see it:

it's for you to decide which format you internally normalise to (i'm not even sure if that's the right word). to which specific *base format* you decide to adhere to. (i'm talking about things like do you treat text in a composed or decomposed form for example). it doesn't matter which internal base format you choose, so long as you stick to it and never try to compare two texts in different 'base formats'. then on top of that you'd need to also apply a way to make use of character mappings - when you get various versions of characters amounting to the same meaning. (there's different levels to that and decisions for you to make - no right nor wrong. the extent to which you allow various character to amount to the same one. (this includes case mappings for example obviously)

i don't see how language differences come into this. the japanese no space thing you mention: if someone types in a particular phrase, in japanese (therefore without spaces, if that is actually the case), then the search will not try and use spaces. and the text that they're searching will not be using spaces as it'll also be in japanese.

all that 'remove' and 'replace' part - you don't have to transform the text, surely you just have to set up rules (or filters) within the code that says for example "a or any number of tabs + a or any number of spaces = 1 space". and if you apply those rules *throughout*, to the text being searched, and the text strings that are inputted and searched for, then all'll be cool (?) maybe.

- replace all dashes with a standard ASCII minus-hyphen

like that part. i wouldn't replace or change any text in any way. i'd just say in the code that any dash amounts to any other dash (and 'any dash' = what you mean by 'all dashes')

basically i wouldn't go about changing characters. just allowing them to represent an array of characters (including nothing/no characters in some cases maybe)

so it's 2 main basic things: convert to base format throughout, and set up rules / filters for characters (which will make heavy use of data, (is it the 'properties' data? - for character grouping and mappings) from unicode, plus a bit more of your own such as saying a variable long line of any white space amounts to one space, if you'd want things with variable amounts of space in to match that is.

On Friday, June 27, 2003, at 12:46 pm, Philippe Verdy wrote:

In order to implement a plain-text search algorithm, in a language neutral way that would still work with all scripts, I am searching for advices on how this can be done "safely" (notably for automated search engines), to allow searching for text matching some basic encoding styles.

My first approach to the problem is to try to simplify the text into a indexable form that would unify "similar" characters. So I'd like to have comments about possible issues in modern languages if I perform the following "search canonicalization":

- Decompose the string into NFKD (this will remove font-related information and isolate combining marks) - Remove all combining characters (with combining class > 0), including Hebrew and Arabic cantillation. (are there significant combining vowel signs that should be kept?) - apply case folding using the Unicode standard (to lowercase preferably) - possibly perform a closure of the above three transforms - remove all controls, excepting TAB, CR, LF, VT, FF - replace all dashes with a standard ASCII minus-hyphen - replace all spacing characters with an ASCII space - replace all other punctuation with spaces. - canonicalize the remaining spaces (no leading and trailing spaces, and alls other sequences replaced with a single space). - (may be) recompose Korean Hangul syllables?

What are the possible caveats, notably for Japanese, Korean and Chinese which traditionally do not use spaces ?

How can we improve the algorithm for searches in Thai without using a dictionnary, so that word breaks could be more easily detected (and marked by inserting a ASCII space) ?

Should I insert a space when there's a change of script type (for example in Japanese, between Hiragana, Katakana, Latin and Kanji ideographs) ?

Is there an existing and documented conversion table used in plain-text search engines ?

Is Unicode working on such search-canonicalization algorithm ?

Thanks for the comments.

-- Philippe.

Re: Plain-text search algorithms: normalization, decomposition, case mapping, word breaks

Reply via email to