Re: Hanzi trad-simp folding and z-variants

2013-06-09 Thread john knightley
On Sun, Jun 9, 2013 at 1:26 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


   Though some confusion as what other questions are being discussed here.

 I think I misused the expression folding at some point. But the original
 query explicitly asked about do[ing] traditional to simplified folding for
 indexing and query processing (*when the mapping is unambiguous*) (emph
 added) so I wasn't really sure where parts of the discussion were going :-)


No problem.




   Japanese has well established traditions for simplifying CJK ideographs
 which are not identical to Chinese if one was to use a folding approach to
 deal with simplifications then there should be differences for Chinese and
 Japanese.

 I think the kyūjitai-shinjitai mappings are not in Unihan. (Compare the
 entries of 廣 (U+5EE3) and the characteristically Japanese character 広
 (U+5E83).) I know that certain contexts retain older forms (KenL talks
 about this somewhere too). Btw if you know about other mappings or good
 resources, I'll be curious to know.


No but of course also interested to know what is available.



   quite well documented is a relative term

I highly respect the work in Cheung  Bauer, but it makes no attempt to
 tell us how easily understood the characters are. Many of them are ad-hoc
 coinages that are not understood by any of my informants; sometimes for say
 6 ways of writing a syllable-morpheme, I can make my informants tell me
 that perhaps *one* of them is passable. This problem isn't easily solved,
 but then the source isn't helpful in knowing which out of the approx 1000
 characters are actually used nowadays. I won't give you a number, as I'd
 have to check more carefully to be quotable. The number of morphemes for
 which there truly seems to be no written representation is *very* low,
 but often the characters in existence aren't exactly comprehensible to many
 native speakers either, and not all of them are unambiguous. This will give
 you an idea.


   It documents 1,095 different Cantonese characters. Familiarity with a
writing system makes the non-obvious parts comprehensible, as can
context. Some Cantonese characters, as for Sawndip by their construction
tend to be ambiguous which often means 'something which sounds like this
known character, and therefore the meaning must be learned.


   Zhuang Sawndip

 Sounds exciting.


Yes,  no shortage of new material to get ones teeth into.



   By best choice do you mean (a) the  person producing the electronic
 form was unable to use the character they wished
   because either it is not yet in Unicode  (b) even though in Unicode the
 person was did not know how to type it so type another character instead
 (c) a less than perfect, or ambiguous, 'spelling'  .  All of which are
 found both for Sinitic languages and non-Sinitic languages when written in
 CJK ideographs, be it printed publications, web-pages or text messages
 between native speakers.

 Nearly all of Cantonese is in Unicode and therefore typeable in theory
 (though some people will not be used to such writing, but I'm sure you know
 this), so it's not (a). I would say it's largely (c) (people will often
 make up their own plausible thing), even though (b) is a reason too.


   Many smart phones whilst having the infrastructure lack either the IME
or font for Cantonese characters in the SIP.

For Zhuang Sawndip Unicode support is very lacking at present, on
average over 10% of the text on a page uses characters not yet in Unicode
(a), and with about 2% of text coming from SIP so typing is often a
challenge for many(b).



   Not standardize does not mean totally beyond analysis or  processing,
 or even necessarily that confusing to a native speaker, they are not
 random, though admittedly more complex than a standardized locale.

 Yes. And we both agree that standardization is desirable.


Yes.

John


 Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-09 Thread Stephan Stiller


Familiarity with a writing system makes the non-obvious parts 
comprehensible, as can context.
The work is a thorough listing of usage instances that the authors could 
encounter in the wild. My informants can't recall ever having seen many 
of these characters. They wouldn't use them, and that they can recognize 
them with sufficient context alone doesn't mean they should be regarded 
as normative in any way.


Some Cantonese characters, as for Sawndip by their construction tend 
to be ambiguous which often means 'something which sounds like this 
known character, and therefore the meaning must be learned.
Many characters that can be and are used for Cantonese, including both 
those that are used for Mandarin as well as those that aren't have more 
than one pronunciation. Many of those in the latter category and even 
those with only a single pronunciation in some sort of vague 
prescriptive sense are used approximately, for their phonetic value. For 
those that aren't standardized, it's unclear to what extent there is 
'knowledge' to learn, as this knowledge hasn't yet stabilized.


Many smart phones whilst having the infrastructure lack either the IME 
or font for Cantonese characters in the SIP.
Most of the Cantonese that's commonly used and recognized is typeable 
with Cangjie or handwriting (pen-stroke) recognition. A huge part of 
HKSCS isn't actually known by the general public. Present-day usage is 
also defined by what's typeable. So it's a two-way interaction. I don't 
know about CN-based smartphones, though.


Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-09 Thread john knightley
On Sun, Jun 9, 2013 at 4:18 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:



  Some Cantonese characters, as for Sawndip by their construction tend to
 be ambiguous which often means 'something which sounds like this known
 character, and therefore the meaning must be learned.

 Many characters that can be and are used for Cantonese, including both
 those that are used for Mandarin as well as those that aren't have more
 than one pronunciation. Many of those in the latter category and even those
 with only a single pronunciation in some sort of vague prescriptive sense
 are used approximately, for their phonetic value. For those that aren't
 standardized, it's unclear to what extent there is 'knowledge' to learn, as
 this knowledge hasn't yet stabilized.


For me non-standardized' means there is not one recognized standard, this
does not mean that things are completely unstable, nor that there are no
traditions of what character is used for what word that have been passed
down for many generations.






  Many smart phones whilst having the infrastructure lack either the IME or
 font for Cantonese characters in the SIP.

 Most of the Cantonese that's commonly used and recognized is typeable
 with Cangjie or handwriting (pen-stroke) recognition. A huge part of HKSCS
 isn't actually known by the general public. Present-day usage is also
 defined by what's typeable. So it's a two-way interaction. I don't know
 about CN-based smartphones, though.


  From both the aspect of range of characters in installed fonts and IME's
many smart phones are quite a long way behind computers at present.
Mandarin has quite good support, however II core which includes some SIP
Cantonese characters, does not seem to be the criteria for many smart
phones, whose Chinese fonts tend to be just BMP.

Regards
John




 Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-09 Thread Stephan Stiller


For me non-standardized' means there is not one recognized standard, 
this does not mean that things are completely unstable, nor that there 
are no traditions of what character is used for what word that have 
been passed down for many generations.


/As I stated/, for a decent number of syllable-morphemes (probably the 
/majority/ of Cheung-Bauer entries shouldn't be considered active or 
passive knowledge), native speakers will have no clue how to write them, 
and the array of characters to chose from (if CB is used for a 
forced-choice task), or often a good portion of the array, either 
appears unsatisfactory to them or is seen as okay but previously 
unknown. Native speakers have no problem approximating these syllables 
otherwise if pressed, but, yes, things for those syllables are not that 
stable and if there are stable traditions, they might not be well-known 
except for a low percentage of CB entries – definitely less than half, 
but I don't want to commit to a specific number.


Nonetheless, both type and token frequency of such syllable-morphemes 
are low.


Stephan



Re: Hanzi trad-simp folding and z-variants

2013-06-09 Thread john knightley
On Sun, Jun 9, 2013 at 5:56 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


   For me non-standardized' means there is not one recognized standard,
 this does not mean that things are completely unstable, nor that there are
 no traditions of what character is used for what word that have been passed
 down for many generations.


 *As I stated*, for a decent number of syllable-morphemes (probably the *
 majority* of Cheung-Bauer entries shouldn't be considered active or
 passive knowledge), native speakers will have no clue how to write them,
 and the array of characters to chose from (if CB is used for a
 forced-choice task), or often a good portion of the array, either appears
 unsatisfactory to them or is seen as okay but previously unknown. Native
 speakers have no problem approximating these syllables otherwise if
 pressed, but, yes, things for those syllables are not that stable and if
 there are stable traditions, they might not be well-known except for a low
 percentage of CB entries – definitely less than half, but I don't want to
 commit to a specific number.


   Yes. The way the Cheung-Bauer list was compiled certainly hard to see
how most of the characters would be in widely known.

   With Zhuang Sawndip I have examining texts from different locations and
eras, that there exists both evidence of transmission from generation to
generation, of progression and also of unstability.

Regards
John


 Nonetheless, both type and token frequency of such syllable-morphemes are
 low.









 Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-09 Thread Stephan Stiller


The way the Cheung-Bauer list was compiled certainly hard to see how 
most of the characters would be in widely known.


I'd need to look at CB again for accurate numbers, but to some extent 
it's simply because some syllable-morphemes are listed with many 
different attested possibilities. So one really wouldn't expect to need 
all ≈1000 characters in there.


There is a tricky aspect to this, though: the left-addition of o (or a 
mouth radical) leaves the exact number a bit open and allows for a 
larger count. Do you write some Cantonese-only syllable-morpheme as X 
or ⿰口X/oX? (Most of the latter combinations are in fact in CB, 
but, anyways, it's hard to give a precise answer to the how many 
Cantonese characters question.) Here is an example: 嚿 vs 舊 for the 
measure word gau6 (lump). Depending on whom you ask, you might even 
find a strong opinion. Most people will probably say that 嚿 is 
better, but the fact that you find 舊 (because it's more 
straightforward to type) means that in a way it's descriptively correct. 
There are cases where the variant without a mouth would be regarded as 
more common or natural, because the version with a mouth radical is 
typographically rare.


With Zhuang Sawndip I have examining texts from different locations 
and eras, that there exists both evidence of transmission from 
generation to generation, of progression and also of unstability.

Just curious: what is a rough character count?

Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-09 Thread john knightley
On Sun, Jun 9, 2013 at 7:29 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


  The way the Cheung-Bauer list was compiled certainly hard to see how most
 of the characters would be in widely known.


 I'd need to look at CB again for accurate numbers, but to some extent
 it's simply because some syllable-morphemes are listed with many different
 attested possibilities. So one really wouldn't expect to need all ≈1000
 characters in there.

 There is a tricky aspect to this, though: the left-addition of o (or a
 mouth radical) leaves the exact number a bit open and allows for a larger
 count. Do you write some Cantonese-only syllable-morpheme as X or
 ⿰口X/oX? (Most of the latter combinations are in fact in CB, but,
 anyways, it's hard to give a precise answer to the how many Cantonese
 characters question.) Here is an example: 嚿 vs 舊 for the measure word gau6
 (lump). Depending on whom you ask, you might even find a strong opinion.
 Most people will probably say that 嚿 is better, but the fact that you
 find 舊 (because it's more straightforward to type) means that in a way it's
 descriptively correct. There are cases where the variant without a mouth
 would be regarded as more common or natural, because the version with a
 mouth radical is typographically rare.


  With Zhuang Sawndip I have examining texts from different locations and
 eras, that there exists both evidence of transmission from generation to
 generation, of progression and also of unstability.

 Just curious: what is a rough character count?


   There are a number of dialects which pushes the numbers up a little. The
only published dictionary has just over ten thousand characters of which
just over half are not in Unicode yet. Count of Sawndip have from different
texts and research published in China is currently around twenty thousand
with ten thousand not in Unicode.

However those currently published material only represent a fraction of
the whole. My best estimate that the total number of Sawndip currently in
circulation is 50 to 100 thousand of which 20 to 30 thousand are presently
in Unicode.

John



 Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread john knightley
On Sat, Jun 8, 2013 at 11:55 AM, Stephan Stiller
stephan.stil...@gmail.comwrote:


   simplified [is] better thought of as abbreviated

 Part of this is a terminological argument. The historical situation is
 indeed more complicated than many people know, but the truth is also that
 irrespective of eg people's past or present usage in handwriting there have
 (in the past and esp in the present) been printing traditions which you can
 pinpoint by political region and time, occasionally by publisher.
 Regardless of what exactly happened during the pre-simplification era,
 there are fairly stable traditions now.



Merely offering an alternative translation of 简体. As you say the historical
situation is  complex, however for Simplified as in the standard used in
mainland China is well defined. The situation also sends to be complex once
one steps putside of Putonghua.


 [quote approximate and adapted:]

  a []fully simplified[] passage of text will contain[] both simplified
 characters and those which have not been simplified [...] and therefore
 [be] tagged as traditional.

 This depends on the algorithm used for tagging. And note that tagging
 doesn't in fact have to be a *binary* classifier.†


Tautological, however the original email was referring using a such a
binary tagging system.




  working at character level is not the best way to go for your purposes,
 a larger units such as words or phrases produce much more meaningful
 results as this mimics the way a person reads Chinese, they do read process
 one character at a time rather word by word.

 I don't think JohnB was suggesting character-based retrieval. (I mean, who
 in his right mind would want to do letter-based (and post–case folding)
 retrieval for English documents? :-) Okay – just a joke, this analogy isn't
 any good.) But of course you're right to point out that simplification or
 the reverse operation (what's the term for that? T-conversion maybe?) is
 word- and context-dependent on the edges.



My point here was folding based on a character by character approach of
traditional to simplified model would not make accurate word based
retrieval from the  resulting text easier but harder.



 A different point: I'm not suggesting imprecision, but people are partly
 used to this in text they've seen converted by those horrible tools you can
 find online for that purpose, and for some characters, people won't
 actually notice.

   Whilst the kZVariant field does mean that characters can, are
 frequently are transposed

 What do you mean by transposed? Could you give an example?


By transposed can sometimes be changed when going different traditions and
locales, it is not a one way street.




   it does not tell you when, also as said above the probability is that
 you have ordinary Chinese text written in the mainland style, folding based
 on the the kZVariant field, would either leave things unchanged or if it
 changed things would misspell words, that is the sounds, or in some cases
 appearance, would probably be similar, or homophones, but would not match
 any dictionaries.

 But if all occurrences of everything you process are folded (folding to
 lower-case is often done in NLP), this isn't a problem. Again, I'm not
 recommending this as best practice, I'm just pointing it out.

  There are Chinese compatibility characters in Unicode which if present
 which it probably would good to fold in but these are not in the scope of
 UniHan.


 My earlier statement about UniHan and compatibility variants was not
correct UniHan does have a   kCompatibilityVariant field.

   And you remind me that z-variation is locale-dependent (see also †
 above). Anyways, I think it's hard to find examples of meaning-divergent
 z-variant words in modern Mandarin (MSM). I'm sure you or someone else will
 be able to quickly dig out examples, but really the question is what set of
 algorithms and data structures is best to address the general situation.
 Have locale-dependent folding tables? Allow a search term prefix that
 specifies don't normalize or fold the following term? Have secondary
 filters in your search that use a stricter model of character identity?


http://www.unicode.org/reports/tr38/  does a good summary of the
possibilities. Trying to fold from one locale to another, which is what
folding from traditional to simplified would be is not a good idea, best
practice is not bear in mind the locale being used, and do information
retrieval on a locale by locale basis.

Regards
John Knightley



 Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread Stephan Stiller


http://www.unicode.org/reports/tr38/ does a good summary of the 
possibilities.

Which and where?

Trying to fold from one locale to another, which is what folding 
from traditional to simplified would be is not a good idea, best 
practice is not bear in mind the locale being used, and do information 
retrieval on a locale by locale basis.

What do you mean?

Put simply: Either you don't let someone search a TW database with 
simplified characters or you convert either the search terms or the 
searched documents internally for the duration of your search – or some 
combination of these options. It is not at all obvious to me what the 
fastest way in a big data context is. There's gotta be research about this.


Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread Stephan Stiller


The situation also sends to be complex once one steps putside of 
Putonghua.
Given that the situation there is a lack of standardization (and a lack 
of tables laying out variant spellings), I don't think anything other 
than radical, hand-tuned folding to cover all possibilities is sensible 
to query a dialectal database.


Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread Stephan Stiller
As far as general folding is concerned, performing conversion (whether 
it's word-based or not and even if it's locale-tailored) and then a 
strict search will let you miss out on the z-variation you find in the 
wild (because of true variation or of misspellings), and a more generous 
inclusion of z-variation is in fact unlikely to give you false matches 
(normally different words don't merely differ on the z-axis, though I 
believe to remember having seen an example involving the name of a 
historical term somewhere).


You are right about this point
My point here was folding based on a character by character approach 
of traditional to simplified model would not make accurate word based 
retrieval from the  resulting text easier but harder.
and the note on transposition. But I also don't think this is the end 
of the story: If you strictly convert on a word level, you will miss 
(note that this point is different from what's in my first paragraph 
above) those search results where your contextual conversion heuristics 
was wrong. Perhaps a Classical Chinese character collocation agrees with 
a modern Chinese term in simplified spelling but should be converted 
directly instead of transposed when going from CN to TW. So for that 
you'd need some sort of n-way expansion of a search query. I don't have 
an example off the top of my head, but I don't think scenario is 
unrealistic at all.


Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread john knightley
On Sat, Jun 8, 2013 at 4:02 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


  
 http://www.unicode.org/**reports/tr38/http://www.unicode.org/reports/tr38/does
  a good summary of the possibilities.

 Which and where?



Section 3.7.1 Simplified and Traditional Chinese Variants talks about
converting between Simplified and Traditional Chinese.




  Trying to fold from one locale to another, which is what folding from
 traditional to simplified would be is not a good idea, best practice is not
 bear in mind the locale being used, and do information retrieval on a
 locale by locale basis.

 What do you mean?

 Put simply: Either you don't let someone search a TW database with
 simplified characters or you convert either the search terms or the
 searched documents internally for the duration of your search – or some
 combination of these options. It is not at all obvious to me what the
 fastest way in a big data context is. There's gotta be research about this.

 Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread john knightley
On Sat, Jun 8, 2013 at 4:05 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


  The situation also tends to be complex once one steps putside of
 Putonghua.

 Given that the situation there is a lack of standardization (and a lack of
 tables laying out variant spellings), I don't think anything other than
 radical, hand-tuned folding to cover all possibilities is sensible to query
 a dialectal database.


Some dialects such as Cantonese are quite well documented, simplification
is also found in for example Japanese CJK ideographs which is documented.
There is an increased interest in such things in recent years.  One persons
'hand-tuned' of today can become the basis  of a standard of tomorrow.

John



 Stephan




Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread Stephan Stiller

I.


Which and where?

Section 3.7.1 Simplified and Traditional Chinese Variants talks about 
converting between Simplified and Traditional Chinese.

You wrote this


http://www.unicode.org/reports/tr38/ does a good summary of
the possibilities.

in response to my inquiry about examples of meaning-divergent z-variant 
words in modern Mandarin and appropriate algorithms and data 
structures. Also, the Unihan database doesn't provide collocational 
data for T/S conversion.



II.

simplification is also found in for example Japanese CJK ideographs 
which is documented
Contextual conversion (and shifting/transposition) is essentially not 
an issue in this context, even though you have an odd case of deviation 
here and there.



Some dialects such as Cantonese are quite well documented

[and]
There is an increased interest in such things in recent years.  One 
persons 'hand-tuned' of today can become the basis  of a standard of 
tomorrow.


1a. I'd say I have a decent grasp of the topic of lexical variation for 
written Cantonese, based on a decent amount of fieldwork. (While we're 
at it, I also know at least one researcher with an interest in 
standardization of Cantonese spelling.) I'm certain that lexical 
variation in Cantonese is not well-documented, though there are a bunch 
of sources from which you can scrap your own thing together.
1b. Keep in mind that most materials in electronic form (originally 
written in this form or digitized) don't use the best character 
choices – needless to say it's gotta be even truer for other Sinitic 
languages.
2. This is entirely unrelated to the question of whether one can or 
should describe simplified characters as abbreviated. There is a 
connection to your statement about things being on a sliding scale (you 
used the word relative), but for Cantonese it's more like this 
translates into a lot of inconsistency between using genuine C spelling, 
a M substitute, a C-based phonetic transcription, ad-hoc usage using the 
mouth radical or a prefixed roman o, an English-based informal 
transcription using Latin letters, and avoidance. Whether this is 
electronically manageable in principle depends on whether you include 
entirely romanized blogs (which I wouldn't recommend), but – in any case 
– anything other than liberal QE (query expansion) will /not/ work. (I 
might previously have misused the word folding to mean conversion.)
3. Other Sinitic languages are essentially not at all standardized 
(we're talking Chinese characters here, not romanizations). Last time I 
checked it seemed like Taiwanese is a total mess, and Shanghainese has a 
(mainland-CN) researcher who is (still) writing a dictionary to actually 
find or document written representations of all syllable-morphemes to 
capture all of SHnese. The best SHnese textbook was published a couple 
of years ago in HK and uses traditional characters (!) to represent 
modern SHnese.


Stephan



Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread Stephan Stiller

better word choice:
lexical variation - orthographic variation (in my prev email)




Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread john knightley
On Sat, Jun 8, 2013 at 9:00 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:

  I.

 Which and where?

 Section 3.7.1 Simplified and Traditional Chinese Variants talks about
 converting between Simplified and Traditional Chinese.

 You wrote this

http://www.unicode.org/reports/tr38/ does a good summary of the
 possibilities.

in response to my inquiry about examples of meaning-divergent
 z-variant words in modern Mandarin and appropriate algorithms and data
 structures. Also, the Unihan database doesn't provide collocational data
 for T/S conversion.


So we both agree that Unihan is not designed to tell people how to covert
between traditional and simplified characters. Though some confusion as
what other questions are being discussed here.




 II.


   simplification is also found in for example Japanese CJK ideographs
 which is documented

 Contextual conversion (and shifting/transposition) is essentially not an
 issue in this context, even though you have an odd case of deviation here
 and there.


   Japanese has well established traditions for simplifying CJK ideographs
which are not identical to Chinese if one was to use a folding approach to
deal with simplifications then there should be differences for Chinese and
Japanese.



   Some dialects such as Cantonese are quite well documented

 [and]

   There is an increased interest in such things in recent years.  One
 persons 'hand-tuned' of today can become the basis  of a standard of
 tomorrow.


 1a. I'd say I have a decent grasp of the topic of lexical variation for
 written Cantonese, based on a decent amount of fieldwork. (While we're at
 it, I also know at least one researcher with an interest in standardization
 of Cantonese spelling.) I'm certain that lexical variation in Cantonese is
 not well-documented, though there are a bunch of sources from which you can
 scrap your own thing together.


quite well documented is a relative term, after Mandarin, Cantonese is
one of the better documented of the Chinese dialects, and better documented
than the use of CJK ideographs for other languages such as say Zhuang
Sawndip my primary are of research. That is not to say there is not more
work to be done on this area in Cantonese.



 1b. Keep in mind that most materials in electronic form (originally
 written in this form or digitized) don't use the best character choices –
 needless to say it's gotta be even truer for other Sinitic languages.


   By best choice do you mean (a) the  person producing the electronic form
was unable to use the character they wished
 because either it is not yet in Unicode  (b) even though in Unicode the
person was did not know how to type it so type another character instead
(c) a less than perfect, or ambiguous, 'spelling'  .  All of which are
found both for Sinitic languages and non-Sinitic languages when written in
CJK ideographs, be it printed publications, web-pages or text messages
between native speakers.



 2. This is entirely unrelated to the question of whether one can or should
 describe simplified characters as abbreviated. There is a connection to
 your statement about things being on a sliding scale (you used the word
 relative), but for Cantonese it's more like this translates into a lot of
 inconsistency between using genuine C spelling, a M substitute, a C-based
 phonetic transcription, ad-hoc usage using the mouth radical or a prefixed
 roman o, an English-based informal transcription using Latin letters, and
 avoidance. Whether this is electronically manageable in principle depends
 on whether you include entirely romanized blogs (which I wouldn't
 recommend), but – in any case – anything other than liberal QE (query
 expansion) will *not* work. (I might previously have misused the word
 folding to mean conversion.)


The this here is not to clear to me. However the features you describe
for Cantonese are also found in Zhuang texts, these where however not what
I meant by abbreviated . As to variants in general yes the scale is wide,
and to a degree dependent upon the locale. Perhaps my email was not clear
either, however I think we where using folding in the same way, namely a
step to be taken before either searching based on a word list or
dictionary,  conversion to a romanized script or text to speech .



  3. Other Sinitic languages are essentially not at all standardized (we're
 talking Chinese characters here, not romanizations). Last time I checked it
 seemed like Taiwanese is a total mess, and Shanghainese has a (mainland-CN)
 researcher who is (still) writing a dictionary to actually find or document
 written representations of all syllable-morphemes to capture all of
 SHnese. The best SHnese textbook was published a couple of years ago in HK
 and uses traditional characters (!) to represent modern SHnese.


Not standardize does not mean totally beyond analysis or  processing,
or even necessarily that confusing to a native speaker, they are not
random, though 

Re: Hanzi trad-simp folding and z-variants

2013-06-08 Thread Stephan Stiller


So we both agree that Unihan is not designed to tell people how to 
covert between traditional and simplified characters.

Yep.


Though some confusion as what other questions are being discussed here.
I think I misused the expression folding at some point. But the 
original query explicitly asked about do[ing] traditional to simplified 
folding for indexing and query processing (/when the mapping is 
unambiguous/) (emph added) so I wasn't really sure where parts of the 
discussion were going :-)


Japanese has well established traditions for simplifying CJK 
ideographs which are not identical to Chinese if one was to use a 
folding approach to deal with simplifications then there should be 
differences for Chinese and Japanese.
I think the kyūjitai-shinjitai mappings are not in Unihan. (Compare the 
entries of 廣 (U+5EE3) and the characteristically Japanese character 広 
(U+5E83).) I know that certain contexts retain older forms (KenL talks 
about this somewhere too). Btw if you know about other mappings or good 
resources, I'll be curious to know.



quite well documented is a relative term
I highly respect the work in Cheung  Bauer, but it makes no attempt to 
tell us how easily understood the characters are. Many of them are 
ad-hoc coinages that are not understood by any of my informants; 
sometimes for say 6 ways of writing a syllable-morpheme, I can make my 
informants tell me that perhaps /one/ of them is passable. This problem 
isn't easily solved, but then the source isn't helpful in knowing which 
out of the approx 1000 characters are actually used nowadays. I won't 
give you a number, as I'd have to check more carefully to be quotable. 
The number of morphemes for which there truly seems to be no written 
representation is /very/ low, but often the characters in existence 
aren't exactly comprehensible to many native speakers either, and not 
all of them are unambiguous. This will give you an idea.



Zhuang Sawndip

Sounds exciting.

By best choice do you mean (a) the  person producing the electronic 
form was unable to use the character they wished
 because either it is not yet in Unicode  (b) even though in Unicode 
the person was did not know how to type it so type another character 
instead  (c) a less than perfect, or ambiguous, 'spelling'  .  All of 
which are found both for Sinitic languages and non-Sinitic languages 
when written in CJK ideographs, be it printed publications, web-pages 
or text messages between native speakers.
Nearly all of Cantonese is in Unicode and therefore typeable in theory 
(though some people will not be used to such writing, but I'm sure you 
know this), so it's not (a). I would say it's largely (c) (people will 
often make up their own plausible thing), even though (b) is a reason too.


Not standardize does not mean totally beyond analysis or  processing, 
or even necessarily that confusing to a native speaker, they are not 
random, though admittedly more complex than a standardized locale.

Yes. And we both agree that standardization is desirable.

Stephan



Re: Hanzi trad-simp folding and z-variants

2013-06-07 Thread Stephan Stiller

Hi John,

This is one of those questions that I've been wondering about as well 
... my guess would be yes that should work (and dealing with z-variants 
is something you'll likely need to do anyways), but there *must* be 
some published algorithm out there that specifically addresses the issue 
of diffferentiable and recoverable folding for indexing.


This comes up in NLP all the time for case folding. My impression is 
that the folks there just fold everything into lowercase and later apply 
a so-called truecasing algorithm (aka truecaser). To someone like me 
this just seems like totally the wrong approach, but I'll be open to be 
convinced otherwise with the right empirical arguments.


If you find some information on data structures and algorithms tailored 
to this problem in the area of indexing/querying, let me know.


Stephan


On 6/6/2013 12:54 PM, John D. Burger wrote:

Hi there -

I'm working on an information retrieval application for a collection of Chinese 
documents, which appear to use a mix of traditional and simplified characters. 
My intuition is that it makes sense to do traditional to simplified folding for 
indexing and query processing (when the mapping is unambiguous), but I'd be 
interested in opinions about this.

Second, I just noticed the kZVariant field in the Unihan.zip file. It seems to 
me that it makes sense to fold these together as well, correct?

Thanks for any information you care to provide.

- John Burger
  MITRE





Re: Hanzi trad-simp folding and z-variants

2013-06-07 Thread john knightley
Resending email: Originally sent by mistake just to sender and not to list.

Dear John,

   Without looking at your texts it I can not say for certain, however it
should be noted that simplified, perhaps better thought of as abbreviated,
is a relative term, therefore a fully simplified passage of text will
contained both simplified characters and those which have not been
simplified, that is abbreviated, and therefore tagged as traditional.

   The situation regarding Chinese documents is somewhat more complicated,
working at character level is not the best way to go for your purposes, a
larger units such as words or phrases produce much more meaningful results
as this mimics the way a person reads Chinese, they do read process one
character at a time rather word by word. Whilst the kZVariant field does
mean that characters can, are frequently are transposed it does not tell
you when, also as said above the probability is that you have ordinary
Chinese text written in the mainland style, folding based on the the
kZVariant field, would either leave things unchanged or if it changed
things would misspell words, that is the sounds, or in some cases
appearance, would probably be similar, or homophones, but would not match
any dictionaries.

   For information retrieval from Chinese documents you require a list of
words or phrases that you are looking for as a minimum, and in simple terms
the longer the phrase the more likely for the match to be correct. How
long, hard to say, it really depends on what information you are looking
for, a list of words such as 现代汉语常用词表 has over 50 thousand words in it, a
list with phrases would be longer.

  In short such a folding algorithm based on kZVariant would not be a good
idea. There are Chinese compatibility characters in Unicode which if
present which it probably would good to fold in but these are not in the
scope of UniHan.

Regards
John Knightley


On Sat, Jun 8, 2013 at 4:00 AM, Stephan Stiller
stephan.stil...@gmail.comwrote:

  Hi John,

 This is one of those questions that I've been wondering about as well ...
 my guess would be yes that should work (and dealing with z-variants is
 something you'll likely need to do anyways), but there *must* be some
 published algorithm out there that specifically addresses the issue of
 diffferentiable and recoverable folding for indexing.

 This comes up in NLP all the time for case folding. My impression is that
 the folks there just fold everything into lowercase and later apply a
 so-called truecasing algorithm (aka truecaser). To someone like me this
 just seems like totally the wrong approach, but I'll be open to be
 convinced otherwise with the right empirical arguments.

 If you find some information on data structures and algorithms tailored to
 this problem in the area of indexing/querying, let me know.

 Stephan



 On 6/6/2013 12:54 PM, John D. Burger wrote:

 Hi there -

 I'm working on an information retrieval application for a collection of 
 Chinese documents, which appear to use a mix of traditional and simplified 
 characters. My intuition is that it makes sense to do traditional to 
 simplified folding for indexing and query processing (when the mapping is 
 unambiguous), but I'd be interested in opinions about this.

 Second, I just noticed the kZVariant field in the Unihan.zip file. It seems 
 to me that it makes sense to fold these together as well, correct?

 Thanks for any information you care to provide.

 - John Burger
  MITRE






Re: Hanzi trad-simp folding and z-variants

2013-06-07 Thread Stephan Stiller



simplified [is] better thought of as abbreviated
Part of this is a terminological argument. The historical situation is 
indeed more complicated than many people know, but the truth is also 
that irrespective of eg people's past or present usage in handwriting 
there have (in the past and esp in the present) been printing traditions 
which you can pinpoint by political region and time, occasionally by 
publisher. Regardless of what exactly happened during the 
pre-simplification era, there are fairly stable traditions now.


[quote approximate and adapted:]
a []fully simplified[] passage of text will contain[] both 
simplified characters and those which have not been simplified [...] 
and therefore [be] tagged as traditional.
This depends on the algorithm used for tagging. And note that tagging 
doesn't in fact have to be a /binary/ classifier.†


working at character level is not the best way to go for your 
purposes, a larger units such as words or phrases produce much more 
meaningful results as this mimics the way a person reads Chinese, they 
do read process one character at a time rather word by word.
I don't think JohnB was suggesting character-based retrieval. (I mean, 
who in his right mind would want to do letter-based (and post–case 
folding) retrieval for English documents? :-) Okay – just a joke, this 
analogy isn't any good.) But of course you're right to point out that 
simplification or the reverse operation (what's the term for that? 
T-conversion maybe?) is word- and context-dependent on the edges.


A different point: I'm not suggesting imprecision, but people are partly 
used to this in text they've seen converted by those horrible tools you 
can find online for that purpose, and for some characters, people won't 
actually notice.


Whilst the kZVariant field does mean that characters can, are 
frequently are transposed

What do you mean by transposed? Could you give an example?

it does not tell you when, also as said above the probability is that 
you have ordinary Chinese text written in the mainland style, folding 
based on the the kZVariant field, would either leave things unchanged 
or if it changed things would misspell words, that is the sounds, or 
in some cases appearance, would probably be similar, or homophones, 
but would not match any dictionaries.
But if all occurrences of everything you process are folded (folding to 
lower-case is often done in NLP), this isn't a problem. Again, I'm not 
recommending this as best practice, I'm just pointing it out.


There are Chinese compatibility characters in Unicode which if present 
which it probably would good to fold in but these are not in the scope 
of UniHan.
And you remind me that z-variation is locale-dependent (see also † 
above). Anyways, I think it's hard to find examples of meaning-divergent 
z-variant words in modern Mandarin (MSM). I'm sure you or someone else 
will be able to quickly dig out examples, but really the question is 
what set of algorithms and data structures is best to address the 
general situation. Have locale-dependent folding tables? Allow a search 
term prefix that specifies don't normalize or fold the following term? 
Have secondary filters in your search that use a stricter model of 
character identity?


Stephan