At 01:02 AM 7/10/2004, Marcin 'Qrczak' Kowalczyk wrote:
But there are cases when I would prefer to fold Polish diacritics in
searches.

It's basically every case when you are not sure that all stored data is
using diacritics,

Or when you are unsure how it is spelled, for example, looking up a personal or geographic name you are not familiar with.


The discussion started around the case where searching is not localized (tailored) to the language, which, by definition means that users will not be familiar with the spelling of the items they are trying to retrieve.

If one wants to find data containing a word, rather than collect
statistics about usage of a word with and without diacritics, it's very
rare than folding does some harm.

Hmm, it's not that simple. When I'm searching for JĘZYK (existing word),
I will be happy to find occurrences of JEZYK too (non-existing word,
must have had diacritics stripped), but it makes no sense to return
JEŻYK (another existing word). It's not just making the letters
equivalent.

There are other types of searches than 'google'. One example is searches for for station names on services such as http://www.bahn.de. Unlike air-travel sites, the number of destinations (all across Europe, by the way), is huge, as the site also includes commuter train services.


They've changed their search algorithm a number of times over the years, but at one time, you could enter a destination without diacritics and it would attempt to match that to the list of known station names. In case of multiple hits it would give you a list to pick from. They also supported alternative non-native names (such as Cologne). I haven't used it in a while, so I don't know what they support today, but when I did, I found it very useful in looking up destinations.

I have a certain sympathy for the idea of designing UCA so that the untailored *default* works for such kind of multilingual usage. However, the other use of the DUCET is to be the most convenient base for applying all tailorings. I have a certain sympathy for the position that claims that there are important, but perhaps specialized or not economically powerful classes of users that will not likely have access to a tailored UCA for their language or writing system.

If that is really the case, i.e. appreciable numbers of smaller languages would be able to survive without tailoring, then the alternative to fixing the DUCET could be a separate publication of a common base tailoring for multilingual data access. (A base tailoring would be applied before further tailoring for a specific language).

A./







Reply via email to