Re: Hanzi trad-simp folding and z-variants
On Sun, Jun 9, 2013 at 1:26 PM, Stephan Stiller stephan.stil...@gmail.comwrote: Though some confusion as what other questions are being discussed here. I think I misused the expression folding at some point. But the original query explicitly asked about do[ing] traditional to simplified folding for indexing and query processing (*when the mapping is unambiguous*) (emph added) so I wasn't really sure where parts of the discussion were going :-) No problem. Japanese has well established traditions for simplifying CJK ideographs which are not identical to Chinese if one was to use a folding approach to deal with simplifications then there should be differences for Chinese and Japanese. I think the kyūjitai-shinjitai mappings are not in Unihan. (Compare the entries of 廣 (U+5EE3) and the characteristically Japanese character 広 (U+5E83).) I know that certain contexts retain older forms (KenL talks about this somewhere too). Btw if you know about other mappings or good resources, I'll be curious to know. No but of course also interested to know what is available. quite well documented is a relative term I highly respect the work in Cheung Bauer, but it makes no attempt to tell us how easily understood the characters are. Many of them are ad-hoc coinages that are not understood by any of my informants; sometimes for say 6 ways of writing a syllable-morpheme, I can make my informants tell me that perhaps *one* of them is passable. This problem isn't easily solved, but then the source isn't helpful in knowing which out of the approx 1000 characters are actually used nowadays. I won't give you a number, as I'd have to check more carefully to be quotable. The number of morphemes for which there truly seems to be no written representation is *very* low, but often the characters in existence aren't exactly comprehensible to many native speakers either, and not all of them are unambiguous. This will give you an idea. It documents 1,095 different Cantonese characters. Familiarity with a writing system makes the non-obvious parts comprehensible, as can context. Some Cantonese characters, as for Sawndip by their construction tend to be ambiguous which often means 'something which sounds like this known character, and therefore the meaning must be learned. Zhuang Sawndip Sounds exciting. Yes, no shortage of new material to get ones teeth into. By best choice do you mean (a) the person producing the electronic form was unable to use the character they wished because either it is not yet in Unicode (b) even though in Unicode the person was did not know how to type it so type another character instead (c) a less than perfect, or ambiguous, 'spelling' . All of which are found both for Sinitic languages and non-Sinitic languages when written in CJK ideographs, be it printed publications, web-pages or text messages between native speakers. Nearly all of Cantonese is in Unicode and therefore typeable in theory (though some people will not be used to such writing, but I'm sure you know this), so it's not (a). I would say it's largely (c) (people will often make up their own plausible thing), even though (b) is a reason too. Many smart phones whilst having the infrastructure lack either the IME or font for Cantonese characters in the SIP. For Zhuang Sawndip Unicode support is very lacking at present, on average over 10% of the text on a page uses characters not yet in Unicode (a), and with about 2% of text coming from SIP so typing is often a challenge for many(b). Not standardize does not mean totally beyond analysis or processing, or even necessarily that confusing to a native speaker, they are not random, though admittedly more complex than a standardized locale. Yes. And we both agree that standardization is desirable. Yes. John Stephan
Re: Hanzi trad-simp folding and z-variants
Familiarity with a writing system makes the non-obvious parts comprehensible, as can context. The work is a thorough listing of usage instances that the authors could encounter in the wild. My informants can't recall ever having seen many of these characters. They wouldn't use them, and that they can recognize them with sufficient context alone doesn't mean they should be regarded as normative in any way. Some Cantonese characters, as for Sawndip by their construction tend to be ambiguous which often means 'something which sounds like this known character, and therefore the meaning must be learned. Many characters that can be and are used for Cantonese, including both those that are used for Mandarin as well as those that aren't have more than one pronunciation. Many of those in the latter category and even those with only a single pronunciation in some sort of vague prescriptive sense are used approximately, for their phonetic value. For those that aren't standardized, it's unclear to what extent there is 'knowledge' to learn, as this knowledge hasn't yet stabilized. Many smart phones whilst having the infrastructure lack either the IME or font for Cantonese characters in the SIP. Most of the Cantonese that's commonly used and recognized is typeable with Cangjie or handwriting (pen-stroke) recognition. A huge part of HKSCS isn't actually known by the general public. Present-day usage is also defined by what's typeable. So it's a two-way interaction. I don't know about CN-based smartphones, though. Stephan
Re: Hanzi trad-simp folding and z-variants
On Sun, Jun 9, 2013 at 4:18 PM, Stephan Stiller stephan.stil...@gmail.comwrote: Some Cantonese characters, as for Sawndip by their construction tend to be ambiguous which often means 'something which sounds like this known character, and therefore the meaning must be learned. Many characters that can be and are used for Cantonese, including both those that are used for Mandarin as well as those that aren't have more than one pronunciation. Many of those in the latter category and even those with only a single pronunciation in some sort of vague prescriptive sense are used approximately, for their phonetic value. For those that aren't standardized, it's unclear to what extent there is 'knowledge' to learn, as this knowledge hasn't yet stabilized. For me non-standardized' means there is not one recognized standard, this does not mean that things are completely unstable, nor that there are no traditions of what character is used for what word that have been passed down for many generations. Many smart phones whilst having the infrastructure lack either the IME or font for Cantonese characters in the SIP. Most of the Cantonese that's commonly used and recognized is typeable with Cangjie or handwriting (pen-stroke) recognition. A huge part of HKSCS isn't actually known by the general public. Present-day usage is also defined by what's typeable. So it's a two-way interaction. I don't know about CN-based smartphones, though. From both the aspect of range of characters in installed fonts and IME's many smart phones are quite a long way behind computers at present. Mandarin has quite good support, however II core which includes some SIP Cantonese characters, does not seem to be the criteria for many smart phones, whose Chinese fonts tend to be just BMP. Regards John Stephan
Re: Hanzi trad-simp folding and z-variants
For me non-standardized' means there is not one recognized standard, this does not mean that things are completely unstable, nor that there are no traditions of what character is used for what word that have been passed down for many generations. /As I stated/, for a decent number of syllable-morphemes (probably the /majority/ of Cheung-Bauer entries shouldn't be considered active or passive knowledge), native speakers will have no clue how to write them, and the array of characters to chose from (if CB is used for a forced-choice task), or often a good portion of the array, either appears unsatisfactory to them or is seen as okay but previously unknown. Native speakers have no problem approximating these syllables otherwise if pressed, but, yes, things for those syllables are not that stable and if there are stable traditions, they might not be well-known except for a low percentage of CB entries – definitely less than half, but I don't want to commit to a specific number. Nonetheless, both type and token frequency of such syllable-morphemes are low. Stephan
Re: Hanzi trad-simp folding and z-variants
On Sun, Jun 9, 2013 at 5:56 PM, Stephan Stiller stephan.stil...@gmail.comwrote: For me non-standardized' means there is not one recognized standard, this does not mean that things are completely unstable, nor that there are no traditions of what character is used for what word that have been passed down for many generations. *As I stated*, for a decent number of syllable-morphemes (probably the * majority* of Cheung-Bauer entries shouldn't be considered active or passive knowledge), native speakers will have no clue how to write them, and the array of characters to chose from (if CB is used for a forced-choice task), or often a good portion of the array, either appears unsatisfactory to them or is seen as okay but previously unknown. Native speakers have no problem approximating these syllables otherwise if pressed, but, yes, things for those syllables are not that stable and if there are stable traditions, they might not be well-known except for a low percentage of CB entries – definitely less than half, but I don't want to commit to a specific number. Yes. The way the Cheung-Bauer list was compiled certainly hard to see how most of the characters would be in widely known. With Zhuang Sawndip I have examining texts from different locations and eras, that there exists both evidence of transmission from generation to generation, of progression and also of unstability. Regards John Nonetheless, both type and token frequency of such syllable-morphemes are low. Stephan
Re: Hanzi trad-simp folding and z-variants
The way the Cheung-Bauer list was compiled certainly hard to see how most of the characters would be in widely known. I'd need to look at CB again for accurate numbers, but to some extent it's simply because some syllable-morphemes are listed with many different attested possibilities. So one really wouldn't expect to need all ≈1000 characters in there. There is a tricky aspect to this, though: the left-addition of o (or a mouth radical) leaves the exact number a bit open and allows for a larger count. Do you write some Cantonese-only syllable-morpheme as X or ⿰口X/oX? (Most of the latter combinations are in fact in CB, but, anyways, it's hard to give a precise answer to the how many Cantonese characters question.) Here is an example: 嚿 vs 舊 for the measure word gau6 (lump). Depending on whom you ask, you might even find a strong opinion. Most people will probably say that 嚿 is better, but the fact that you find 舊 (because it's more straightforward to type) means that in a way it's descriptively correct. There are cases where the variant without a mouth would be regarded as more common or natural, because the version with a mouth radical is typographically rare. With Zhuang Sawndip I have examining texts from different locations and eras, that there exists both evidence of transmission from generation to generation, of progression and also of unstability. Just curious: what is a rough character count? Stephan
Re: Hanzi trad-simp folding and z-variants
On Sun, Jun 9, 2013 at 7:29 PM, Stephan Stiller stephan.stil...@gmail.comwrote: The way the Cheung-Bauer list was compiled certainly hard to see how most of the characters would be in widely known. I'd need to look at CB again for accurate numbers, but to some extent it's simply because some syllable-morphemes are listed with many different attested possibilities. So one really wouldn't expect to need all ≈1000 characters in there. There is a tricky aspect to this, though: the left-addition of o (or a mouth radical) leaves the exact number a bit open and allows for a larger count. Do you write some Cantonese-only syllable-morpheme as X or ⿰口X/oX? (Most of the latter combinations are in fact in CB, but, anyways, it's hard to give a precise answer to the how many Cantonese characters question.) Here is an example: 嚿 vs 舊 for the measure word gau6 (lump). Depending on whom you ask, you might even find a strong opinion. Most people will probably say that 嚿 is better, but the fact that you find 舊 (because it's more straightforward to type) means that in a way it's descriptively correct. There are cases where the variant without a mouth would be regarded as more common or natural, because the version with a mouth radical is typographically rare. With Zhuang Sawndip I have examining texts from different locations and eras, that there exists both evidence of transmission from generation to generation, of progression and also of unstability. Just curious: what is a rough character count? There are a number of dialects which pushes the numbers up a little. The only published dictionary has just over ten thousand characters of which just over half are not in Unicode yet. Count of Sawndip have from different texts and research published in China is currently around twenty thousand with ten thousand not in Unicode. However those currently published material only represent a fraction of the whole. My best estimate that the total number of Sawndip currently in circulation is 50 to 100 thousand of which 20 to 30 thousand are presently in Unicode. John Stephan
Re: Hanzi trad-simp folding and z-variants
On Sat, Jun 8, 2013 at 11:55 AM, Stephan Stiller stephan.stil...@gmail.comwrote: simplified [is] better thought of as abbreviated Part of this is a terminological argument. The historical situation is indeed more complicated than many people know, but the truth is also that irrespective of eg people's past or present usage in handwriting there have (in the past and esp in the present) been printing traditions which you can pinpoint by political region and time, occasionally by publisher. Regardless of what exactly happened during the pre-simplification era, there are fairly stable traditions now. Merely offering an alternative translation of 简体. As you say the historical situation is complex, however for Simplified as in the standard used in mainland China is well defined. The situation also sends to be complex once one steps putside of Putonghua. [quote approximate and adapted:] a []fully simplified[] passage of text will contain[] both simplified characters and those which have not been simplified [...] and therefore [be] tagged as traditional. This depends on the algorithm used for tagging. And note that tagging doesn't in fact have to be a *binary* classifier.† Tautological, however the original email was referring using a such a binary tagging system. working at character level is not the best way to go for your purposes, a larger units such as words or phrases produce much more meaningful results as this mimics the way a person reads Chinese, they do read process one character at a time rather word by word. I don't think JohnB was suggesting character-based retrieval. (I mean, who in his right mind would want to do letter-based (and post–case folding) retrieval for English documents? :-) Okay – just a joke, this analogy isn't any good.) But of course you're right to point out that simplification or the reverse operation (what's the term for that? T-conversion maybe?) is word- and context-dependent on the edges. My point here was folding based on a character by character approach of traditional to simplified model would not make accurate word based retrieval from the resulting text easier but harder. A different point: I'm not suggesting imprecision, but people are partly used to this in text they've seen converted by those horrible tools you can find online for that purpose, and for some characters, people won't actually notice. Whilst the kZVariant field does mean that characters can, are frequently are transposed What do you mean by transposed? Could you give an example? By transposed can sometimes be changed when going different traditions and locales, it is not a one way street. it does not tell you when, also as said above the probability is that you have ordinary Chinese text written in the mainland style, folding based on the the kZVariant field, would either leave things unchanged or if it changed things would misspell words, that is the sounds, or in some cases appearance, would probably be similar, or homophones, but would not match any dictionaries. But if all occurrences of everything you process are folded (folding to lower-case is often done in NLP), this isn't a problem. Again, I'm not recommending this as best practice, I'm just pointing it out. There are Chinese compatibility characters in Unicode which if present which it probably would good to fold in but these are not in the scope of UniHan. My earlier statement about UniHan and compatibility variants was not correct UniHan does have a kCompatibilityVariant field. And you remind me that z-variation is locale-dependent (see also † above). Anyways, I think it's hard to find examples of meaning-divergent z-variant words in modern Mandarin (MSM). I'm sure you or someone else will be able to quickly dig out examples, but really the question is what set of algorithms and data structures is best to address the general situation. Have locale-dependent folding tables? Allow a search term prefix that specifies don't normalize or fold the following term? Have secondary filters in your search that use a stricter model of character identity? http://www.unicode.org/reports/tr38/ does a good summary of the possibilities. Trying to fold from one locale to another, which is what folding from traditional to simplified would be is not a good idea, best practice is not bear in mind the locale being used, and do information retrieval on a locale by locale basis. Regards John Knightley Stephan
Re: Hanzi trad-simp folding and z-variants
http://www.unicode.org/reports/tr38/ does a good summary of the possibilities. Which and where? Trying to fold from one locale to another, which is what folding from traditional to simplified would be is not a good idea, best practice is not bear in mind the locale being used, and do information retrieval on a locale by locale basis. What do you mean? Put simply: Either you don't let someone search a TW database with simplified characters or you convert either the search terms or the searched documents internally for the duration of your search – or some combination of these options. It is not at all obvious to me what the fastest way in a big data context is. There's gotta be research about this. Stephan
Re: Hanzi trad-simp folding and z-variants
The situation also sends to be complex once one steps putside of Putonghua. Given that the situation there is a lack of standardization (and a lack of tables laying out variant spellings), I don't think anything other than radical, hand-tuned folding to cover all possibilities is sensible to query a dialectal database. Stephan
Re: Hanzi trad-simp folding and z-variants
As far as general folding is concerned, performing conversion (whether it's word-based or not and even if it's locale-tailored) and then a strict search will let you miss out on the z-variation you find in the wild (because of true variation or of misspellings), and a more generous inclusion of z-variation is in fact unlikely to give you false matches (normally different words don't merely differ on the z-axis, though I believe to remember having seen an example involving the name of a historical term somewhere). You are right about this point My point here was folding based on a character by character approach of traditional to simplified model would not make accurate word based retrieval from the resulting text easier but harder. and the note on transposition. But I also don't think this is the end of the story: If you strictly convert on a word level, you will miss (note that this point is different from what's in my first paragraph above) those search results where your contextual conversion heuristics was wrong. Perhaps a Classical Chinese character collocation agrees with a modern Chinese term in simplified spelling but should be converted directly instead of transposed when going from CN to TW. So for that you'd need some sort of n-way expansion of a search query. I don't have an example off the top of my head, but I don't think scenario is unrealistic at all. Stephan
Re: Hanzi trad-simp folding and z-variants
On Sat, Jun 8, 2013 at 4:02 PM, Stephan Stiller stephan.stil...@gmail.comwrote: http://www.unicode.org/**reports/tr38/http://www.unicode.org/reports/tr38/does a good summary of the possibilities. Which and where? Section 3.7.1 Simplified and Traditional Chinese Variants talks about converting between Simplified and Traditional Chinese. Trying to fold from one locale to another, which is what folding from traditional to simplified would be is not a good idea, best practice is not bear in mind the locale being used, and do information retrieval on a locale by locale basis. What do you mean? Put simply: Either you don't let someone search a TW database with simplified characters or you convert either the search terms or the searched documents internally for the duration of your search – or some combination of these options. It is not at all obvious to me what the fastest way in a big data context is. There's gotta be research about this. Stephan
Re: Hanzi trad-simp folding and z-variants
On Sat, Jun 8, 2013 at 4:05 PM, Stephan Stiller stephan.stil...@gmail.comwrote: The situation also tends to be complex once one steps putside of Putonghua. Given that the situation there is a lack of standardization (and a lack of tables laying out variant spellings), I don't think anything other than radical, hand-tuned folding to cover all possibilities is sensible to query a dialectal database. Some dialects such as Cantonese are quite well documented, simplification is also found in for example Japanese CJK ideographs which is documented. There is an increased interest in such things in recent years. One persons 'hand-tuned' of today can become the basis of a standard of tomorrow. John Stephan
Re: Hanzi trad-simp folding and z-variants
I. Which and where? Section 3.7.1 Simplified and Traditional Chinese Variants talks about converting between Simplified and Traditional Chinese. You wrote this http://www.unicode.org/reports/tr38/ does a good summary of the possibilities. in response to my inquiry about examples of meaning-divergent z-variant words in modern Mandarin and appropriate algorithms and data structures. Also, the Unihan database doesn't provide collocational data for T/S conversion. II. simplification is also found in for example Japanese CJK ideographs which is documented Contextual conversion (and shifting/transposition) is essentially not an issue in this context, even though you have an odd case of deviation here and there. Some dialects such as Cantonese are quite well documented [and] There is an increased interest in such things in recent years. One persons 'hand-tuned' of today can become the basis of a standard of tomorrow. 1a. I'd say I have a decent grasp of the topic of lexical variation for written Cantonese, based on a decent amount of fieldwork. (While we're at it, I also know at least one researcher with an interest in standardization of Cantonese spelling.) I'm certain that lexical variation in Cantonese is not well-documented, though there are a bunch of sources from which you can scrap your own thing together. 1b. Keep in mind that most materials in electronic form (originally written in this form or digitized) don't use the best character choices – needless to say it's gotta be even truer for other Sinitic languages. 2. This is entirely unrelated to the question of whether one can or should describe simplified characters as abbreviated. There is a connection to your statement about things being on a sliding scale (you used the word relative), but for Cantonese it's more like this translates into a lot of inconsistency between using genuine C spelling, a M substitute, a C-based phonetic transcription, ad-hoc usage using the mouth radical or a prefixed roman o, an English-based informal transcription using Latin letters, and avoidance. Whether this is electronically manageable in principle depends on whether you include entirely romanized blogs (which I wouldn't recommend), but – in any case – anything other than liberal QE (query expansion) will /not/ work. (I might previously have misused the word folding to mean conversion.) 3. Other Sinitic languages are essentially not at all standardized (we're talking Chinese characters here, not romanizations). Last time I checked it seemed like Taiwanese is a total mess, and Shanghainese has a (mainland-CN) researcher who is (still) writing a dictionary to actually find or document written representations of all syllable-morphemes to capture all of SHnese. The best SHnese textbook was published a couple of years ago in HK and uses traditional characters (!) to represent modern SHnese. Stephan
Re: Hanzi trad-simp folding and z-variants
better word choice: lexical variation - orthographic variation (in my prev email)
Re: Hanzi trad-simp folding and z-variants
On Sat, Jun 8, 2013 at 9:00 PM, Stephan Stiller stephan.stil...@gmail.comwrote: I. Which and where? Section 3.7.1 Simplified and Traditional Chinese Variants talks about converting between Simplified and Traditional Chinese. You wrote this http://www.unicode.org/reports/tr38/ does a good summary of the possibilities. in response to my inquiry about examples of meaning-divergent z-variant words in modern Mandarin and appropriate algorithms and data structures. Also, the Unihan database doesn't provide collocational data for T/S conversion. So we both agree that Unihan is not designed to tell people how to covert between traditional and simplified characters. Though some confusion as what other questions are being discussed here. II. simplification is also found in for example Japanese CJK ideographs which is documented Contextual conversion (and shifting/transposition) is essentially not an issue in this context, even though you have an odd case of deviation here and there. Japanese has well established traditions for simplifying CJK ideographs which are not identical to Chinese if one was to use a folding approach to deal with simplifications then there should be differences for Chinese and Japanese. Some dialects such as Cantonese are quite well documented [and] There is an increased interest in such things in recent years. One persons 'hand-tuned' of today can become the basis of a standard of tomorrow. 1a. I'd say I have a decent grasp of the topic of lexical variation for written Cantonese, based on a decent amount of fieldwork. (While we're at it, I also know at least one researcher with an interest in standardization of Cantonese spelling.) I'm certain that lexical variation in Cantonese is not well-documented, though there are a bunch of sources from which you can scrap your own thing together. quite well documented is a relative term, after Mandarin, Cantonese is one of the better documented of the Chinese dialects, and better documented than the use of CJK ideographs for other languages such as say Zhuang Sawndip my primary are of research. That is not to say there is not more work to be done on this area in Cantonese. 1b. Keep in mind that most materials in electronic form (originally written in this form or digitized) don't use the best character choices – needless to say it's gotta be even truer for other Sinitic languages. By best choice do you mean (a) the person producing the electronic form was unable to use the character they wished because either it is not yet in Unicode (b) even though in Unicode the person was did not know how to type it so type another character instead (c) a less than perfect, or ambiguous, 'spelling' . All of which are found both for Sinitic languages and non-Sinitic languages when written in CJK ideographs, be it printed publications, web-pages or text messages between native speakers. 2. This is entirely unrelated to the question of whether one can or should describe simplified characters as abbreviated. There is a connection to your statement about things being on a sliding scale (you used the word relative), but for Cantonese it's more like this translates into a lot of inconsistency between using genuine C spelling, a M substitute, a C-based phonetic transcription, ad-hoc usage using the mouth radical or a prefixed roman o, an English-based informal transcription using Latin letters, and avoidance. Whether this is electronically manageable in principle depends on whether you include entirely romanized blogs (which I wouldn't recommend), but – in any case – anything other than liberal QE (query expansion) will *not* work. (I might previously have misused the word folding to mean conversion.) The this here is not to clear to me. However the features you describe for Cantonese are also found in Zhuang texts, these where however not what I meant by abbreviated . As to variants in general yes the scale is wide, and to a degree dependent upon the locale. Perhaps my email was not clear either, however I think we where using folding in the same way, namely a step to be taken before either searching based on a word list or dictionary, conversion to a romanized script or text to speech . 3. Other Sinitic languages are essentially not at all standardized (we're talking Chinese characters here, not romanizations). Last time I checked it seemed like Taiwanese is a total mess, and Shanghainese has a (mainland-CN) researcher who is (still) writing a dictionary to actually find or document written representations of all syllable-morphemes to capture all of SHnese. The best SHnese textbook was published a couple of years ago in HK and uses traditional characters (!) to represent modern SHnese. Not standardize does not mean totally beyond analysis or processing, or even necessarily that confusing to a native speaker, they are not random, though
Re: Hanzi trad-simp folding and z-variants
So we both agree that Unihan is not designed to tell people how to covert between traditional and simplified characters. Yep. Though some confusion as what other questions are being discussed here. I think I misused the expression folding at some point. But the original query explicitly asked about do[ing] traditional to simplified folding for indexing and query processing (/when the mapping is unambiguous/) (emph added) so I wasn't really sure where parts of the discussion were going :-) Japanese has well established traditions for simplifying CJK ideographs which are not identical to Chinese if one was to use a folding approach to deal with simplifications then there should be differences for Chinese and Japanese. I think the kyūjitai-shinjitai mappings are not in Unihan. (Compare the entries of 廣 (U+5EE3) and the characteristically Japanese character 広 (U+5E83).) I know that certain contexts retain older forms (KenL talks about this somewhere too). Btw if you know about other mappings or good resources, I'll be curious to know. quite well documented is a relative term I highly respect the work in Cheung Bauer, but it makes no attempt to tell us how easily understood the characters are. Many of them are ad-hoc coinages that are not understood by any of my informants; sometimes for say 6 ways of writing a syllable-morpheme, I can make my informants tell me that perhaps /one/ of them is passable. This problem isn't easily solved, but then the source isn't helpful in knowing which out of the approx 1000 characters are actually used nowadays. I won't give you a number, as I'd have to check more carefully to be quotable. The number of morphemes for which there truly seems to be no written representation is /very/ low, but often the characters in existence aren't exactly comprehensible to many native speakers either, and not all of them are unambiguous. This will give you an idea. Zhuang Sawndip Sounds exciting. By best choice do you mean (a) the person producing the electronic form was unable to use the character they wished because either it is not yet in Unicode (b) even though in Unicode the person was did not know how to type it so type another character instead (c) a less than perfect, or ambiguous, 'spelling' . All of which are found both for Sinitic languages and non-Sinitic languages when written in CJK ideographs, be it printed publications, web-pages or text messages between native speakers. Nearly all of Cantonese is in Unicode and therefore typeable in theory (though some people will not be used to such writing, but I'm sure you know this), so it's not (a). I would say it's largely (c) (people will often make up their own plausible thing), even though (b) is a reason too. Not standardize does not mean totally beyond analysis or processing, or even necessarily that confusing to a native speaker, they are not random, though admittedly more complex than a standardized locale. Yes. And we both agree that standardization is desirable. Stephan
Re: Hanzi trad-simp folding and z-variants
Hi John, This is one of those questions that I've been wondering about as well ... my guess would be yes that should work (and dealing with z-variants is something you'll likely need to do anyways), but there *must* be some published algorithm out there that specifically addresses the issue of diffferentiable and recoverable folding for indexing. This comes up in NLP all the time for case folding. My impression is that the folks there just fold everything into lowercase and later apply a so-called truecasing algorithm (aka truecaser). To someone like me this just seems like totally the wrong approach, but I'll be open to be convinced otherwise with the right empirical arguments. If you find some information on data structures and algorithms tailored to this problem in the area of indexing/querying, let me know. Stephan On 6/6/2013 12:54 PM, John D. Burger wrote: Hi there - I'm working on an information retrieval application for a collection of Chinese documents, which appear to use a mix of traditional and simplified characters. My intuition is that it makes sense to do traditional to simplified folding for indexing and query processing (when the mapping is unambiguous), but I'd be interested in opinions about this. Second, I just noticed the kZVariant field in the Unihan.zip file. It seems to me that it makes sense to fold these together as well, correct? Thanks for any information you care to provide. - John Burger MITRE
Re: Hanzi trad-simp folding and z-variants
Resending email: Originally sent by mistake just to sender and not to list. Dear John, Without looking at your texts it I can not say for certain, however it should be noted that simplified, perhaps better thought of as abbreviated, is a relative term, therefore a fully simplified passage of text will contained both simplified characters and those which have not been simplified, that is abbreviated, and therefore tagged as traditional. The situation regarding Chinese documents is somewhat more complicated, working at character level is not the best way to go for your purposes, a larger units such as words or phrases produce much more meaningful results as this mimics the way a person reads Chinese, they do read process one character at a time rather word by word. Whilst the kZVariant field does mean that characters can, are frequently are transposed it does not tell you when, also as said above the probability is that you have ordinary Chinese text written in the mainland style, folding based on the the kZVariant field, would either leave things unchanged or if it changed things would misspell words, that is the sounds, or in some cases appearance, would probably be similar, or homophones, but would not match any dictionaries. For information retrieval from Chinese documents you require a list of words or phrases that you are looking for as a minimum, and in simple terms the longer the phrase the more likely for the match to be correct. How long, hard to say, it really depends on what information you are looking for, a list of words such as 现代汉语常用词表 has over 50 thousand words in it, a list with phrases would be longer. In short such a folding algorithm based on kZVariant would not be a good idea. There are Chinese compatibility characters in Unicode which if present which it probably would good to fold in but these are not in the scope of UniHan. Regards John Knightley On Sat, Jun 8, 2013 at 4:00 AM, Stephan Stiller stephan.stil...@gmail.comwrote: Hi John, This is one of those questions that I've been wondering about as well ... my guess would be yes that should work (and dealing with z-variants is something you'll likely need to do anyways), but there *must* be some published algorithm out there that specifically addresses the issue of diffferentiable and recoverable folding for indexing. This comes up in NLP all the time for case folding. My impression is that the folks there just fold everything into lowercase and later apply a so-called truecasing algorithm (aka truecaser). To someone like me this just seems like totally the wrong approach, but I'll be open to be convinced otherwise with the right empirical arguments. If you find some information on data structures and algorithms tailored to this problem in the area of indexing/querying, let me know. Stephan On 6/6/2013 12:54 PM, John D. Burger wrote: Hi there - I'm working on an information retrieval application for a collection of Chinese documents, which appear to use a mix of traditional and simplified characters. My intuition is that it makes sense to do traditional to simplified folding for indexing and query processing (when the mapping is unambiguous), but I'd be interested in opinions about this. Second, I just noticed the kZVariant field in the Unihan.zip file. It seems to me that it makes sense to fold these together as well, correct? Thanks for any information you care to provide. - John Burger MITRE
Re: Hanzi trad-simp folding and z-variants
simplified [is] better thought of as abbreviated Part of this is a terminological argument. The historical situation is indeed more complicated than many people know, but the truth is also that irrespective of eg people's past or present usage in handwriting there have (in the past and esp in the present) been printing traditions which you can pinpoint by political region and time, occasionally by publisher. Regardless of what exactly happened during the pre-simplification era, there are fairly stable traditions now. [quote approximate and adapted:] a []fully simplified[] passage of text will contain[] both simplified characters and those which have not been simplified [...] and therefore [be] tagged as traditional. This depends on the algorithm used for tagging. And note that tagging doesn't in fact have to be a /binary/ classifier.† working at character level is not the best way to go for your purposes, a larger units such as words or phrases produce much more meaningful results as this mimics the way a person reads Chinese, they do read process one character at a time rather word by word. I don't think JohnB was suggesting character-based retrieval. (I mean, who in his right mind would want to do letter-based (and post–case folding) retrieval for English documents? :-) Okay – just a joke, this analogy isn't any good.) But of course you're right to point out that simplification or the reverse operation (what's the term for that? T-conversion maybe?) is word- and context-dependent on the edges. A different point: I'm not suggesting imprecision, but people are partly used to this in text they've seen converted by those horrible tools you can find online for that purpose, and for some characters, people won't actually notice. Whilst the kZVariant field does mean that characters can, are frequently are transposed What do you mean by transposed? Could you give an example? it does not tell you when, also as said above the probability is that you have ordinary Chinese text written in the mainland style, folding based on the the kZVariant field, would either leave things unchanged or if it changed things would misspell words, that is the sounds, or in some cases appearance, would probably be similar, or homophones, but would not match any dictionaries. But if all occurrences of everything you process are folded (folding to lower-case is often done in NLP), this isn't a problem. Again, I'm not recommending this as best practice, I'm just pointing it out. There are Chinese compatibility characters in Unicode which if present which it probably would good to fold in but these are not in the scope of UniHan. And you remind me that z-variation is locale-dependent (see also † above). Anyways, I think it's hard to find examples of meaning-divergent z-variant words in modern Mandarin (MSM). I'm sure you or someone else will be able to quickly dig out examples, but really the question is what set of algorithms and data structures is best to address the general situation. Have locale-dependent folding tables? Allow a search term prefix that specifies don't normalize or fold the following term? Have secondary filters in your search that use a stricter model of character identity? Stephan