RE: Folding algorithm and canonical equivalence

Asmus Freytag Sun, 18 Jul 2004 23:26:27 -0700

At 07:53 PM 7/18/2004, Jony Rosenne wrote:

By this logic, I cannot see why you lump Latin/Greek/Cyrillic together.

Latin/Greek/Cyrillic share the fact that for searches you may want to remove accents, but, except for very unusual circumstances, it's not a good idea to transform text permanently.

If I understand the situation for Hebrew correctly, unpointed Hebrew is quite valid on its own, and the situations where someone might want to use that as a transform are more widespread. If that is true, breaking it out into separate files allows one to take mixed French / Hebrew text and transform the Hebrew while not affecting the French.

The other reason is that, again as far as I can understand this, generic diacritics are not used with Hebrew (except perhaps for some highly technical texts). Therefore it would be easier to specify it as the removal of any marks with the Hebrew script code

HebrewAccentFolding ; sc = Hebrew & gc=Mn; <null>

I think there should be a single diacritics removal folding, which should be
tailorable.


The generic diacritic folding would then be built up as follows:

DiacriticRemoval = AccentFolding + OtherDiacriticFolding + HebrewAccentFolding + ArabicSyriacFolding....

where 'HebrewAccentFolding' is as defined above, OtherDiacriticFolding would be the set remaining in the current DiacriticFolding.txt after canonical decompositions are removed, and ArabicSyriacFolding is defined along the same lines as HebrewAccentFolding.

Voila, you have your generic label to invoke DiacriticRemoval, but the pieces are still accessible in reasonable chunks.

A./

RE: Folding algorithm and canonical equivalence

Reply via email to