Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-21 Thread Kino
On Sunday, Jun 22, 2003, at 09:06 Asia/Tokyo, Philippe Verdy wrote:

For the case of a prolonged sound mark after a Latin letter, I don't 
know
how to classify this usage, but my translator persisted to say it was
correct, and refused to insert a space before it (and he was probably
right if it's effectively interpreted as an extender of the last 
vowel, even
if it's a latin vowel...
Not everyone knows well about characterset/charactercode. So it is very 
often that a native speaker makes a mistake of this kind.

Well, U+30FC (KATAKANA-HIRAGANA PROLONGED SOUND MARK, Shift_JIS 213C) 
*should* be used only after a hiragana-katakana letter. As to 
separator(?), we *should* use two consecutive U+2014 (EM DASH, 
Shift_JIS 213D).

However some people use a single U+2014 as PROLONGED SOUND MARK often 
unknowingly but sometimes knowingly preferring the character shape of 
U+2014 to that of U+30FC.

The use of U+30FC instead of two U+2014 is simply wrong. Many Japanese 
people are affected by this mistake presumably because they would not 
know U+2014 (Shift_JIS 213D) is different from U+30FC (Shift_JIS 213C) 
and/or U+30FC would be easier to enter than U+2014 via Japanese Input 
method. However you would not need to correct your translator. Japanese 
publishers seem to be well aware of common mistakes of this kind ;-)

Kino





Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-21 Thread Allen Haaheim
From: "Philippe Verdy" <[EMAIL PROTECTED]>
> It's difficult to imagine that this sound mark
>can be considered as an extender of a Latin letter, to which it does not
>apply really.

Yes, it doesn't apply.

>For the case of a prolonged sound mark after a Latin letter, I don't know
>how to classify this usage, but my translator persisted to say it was
>correct, and refused to insert a space before it (and he was probably
>right if it's effectively interpreted as an extender of the last vowel,
even
>if it's a latin vowel...

We don't agree that U+30FC is ever grammatical after a Latin letter.Your
translator could have meant an em-dash. If he really meant U+30FC, we can't
agree.

Some people typing Japanese (on a Mac) don't bother to switch back to
English to enter a dash, and instead get U+30FC. The same key is used for
both. So there are plenty floating around out there. Also, its increasing
use with hiragana seems to be becoming accepted. This was not its original
function as an extender of katakana, so it appears to be in a state of flux.

Best wishes,

Allen




Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-21 Thread Philippe Verdy
From: "Allen Haaheim" <[EMAIL PROTECTED]>
> Phillippe,
> Sorry to reopen a (closed?) case. The below look like loose ends to me.

I thought it was closed too. Well I can reply, but I will just give my opinion
after reading translations to Japanese performed by other people, and
hearing their comments.

> >For Japanese people, they consider this sign as a separate vowel whose
> >phonetic value depends on the phonetic value of the previous character
> >(which may have a point or double-point diacritic, for the voice mark used
> >to alter the consonnant value of the base character). This is proably why
> >the transliteration of this character to Latin generally doubles the
> >previous Latin vowel.
> 
> "Separate" doesn't seem right. In my understanding it's an "extender" (as
> Andrew notes) of the final vowel sound of the previous kana (so mentioning
> diacritics, which affect only the initial consonant, is irrelevant). To be
> more exact, it doubles the length of the vowel final.

The term "separate" comes from the fact that it can be used in some cases
after some non-Hiragana and non-Katakana characters, for example after
imported Latin-written words. It's difficult to imagine that this sound mark
can be considered as an extender of a Latin letter, to which it does not
apply really.

> >However, this character is not strictly a diacritic, as there is some uses
> >of the character (according to grammatical rules) after a punctuation sign
> >used to separate it from an imported foreign word (most often a proper
> >name), sometimes written with another script.

I have no sample to give you immediately, but I saw it in translations to
Japanese I gave to some Japanese native, which used the sound mark
after imported names (that were not transliterated to Hiragana or Katakana).

As I noted whever there should not be a space between the imported
name and the rest of the Japanese text, the translator explained to me
that this was a common use for imported names that were best written
without being transliterated, such as trademarks or company names.

Well I must admit that I am sometimes surprised about the way some
language can alter the termination of a trademark or a physical person
name according to somem common grammatical rules that are probably
valid for names used in the corresponding countries, but look ugly for
imported names, as this creates sometimes conflicts with distinct
foreign trademarks or foreign people.

I can't verify if they are correctly interpreting a national grammatical rule.
Each time in that case, I try to suggest to use a less litteral translation
that would be grammatically correct but that would respect, if possible
the original name (which should be given at least once with its original
unique and normally invariable orthograph).

For the case of a prolonged sound mark after a Latin letter, I don't know
how to classify this usage, but my translator persisted to say it was
correct, and refused to insert a space before it (and he was probably
right if it's effectively interpreted as an extender of the last vowel, even
if it's a latin vowel...

My only knowledge of Japanese is limited to perform some dictionnary
checks to verify the content of a translation, and check its encoding,
or allowing exchanges with translators. But I cannot read it "in the text"...

If you have a better knowledge of Japanese than me, I won't try to convince
you of anything, as my interpretation may simply use inaccurate terms
for your point of view. But if you are not a Japanese native, your scholar
studies of the Japanese language may have ignored some local usages
that native Japanese writters (or translators) accept and use quite
commonly.

Only a Japanese native could reply to explain if that usage is just abusive
and considered incorrect, or if it's common.



Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-21 Thread Allen Haaheim
Phillippe,

Sorry to reopen a (closed?) case. The below look like loose ends to me.

>For Japanese people, they consider this sign as a separate vowel whose
>phonetic value depends on the phonetic value of the previous character
>(which may have a point or double-point diacritic, for the voice mark used
>to alter the consonnant value of the base character). This is proably why
>the transliteration of this character to Latin generally doubles the
>previous Latin vowel.

"Separate" doesn't seem right. In my understanding it's an "extender" (as
Andrew notes) of the final vowel sound of the previous kana (so mentioning
diacritics, which affect only the initial consonant, is irrelevant). To be
more exact, it doubles the length of the vowel final.

>However, this character is not strictly a diacritic, as there is some uses
>of the character (according to grammatical rules) after a punctuation sign
>used to separate it from an imported foreign word (most often a proper
>name), sometimes written with another script.

We can't think of any instances of such a use here. Can you give an example?

Allen

- Original Message - 
From: "Philippe Verdy" <[EMAIL PROTECTED]>
To: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Thursday, June 05, 2003 2:35 AM
Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK


> My opinion is that it can be viewed, depending on its application, as a
letter (for some transliteration purpose), or as a diacritic (for some other
transliterations). But in reality it is mostly a letter modifier. For UCA,
it sorts mostly like the base letter that it modifies, and UCA gives the
most appropriate linguistic value of this character.
>
> This is not the only character of this type in Unicode. You'll find
similar sound marks (length marks, repeat marks) in other scripts, including
abjads, and IPA (the IPA column-like sign for example).
>
> For Japanese people, they consider this sign as a separate vowel whose
phonetic value depends on the phonetic value of the previous character
(which may have a point or double-point diacritic, for the voice mark used
to alter the consonnant value of the base character). This is proably why
the transliteration of this character to Latin generally doubles the
previous Latin vowel.
>
> However, this character is not strictly a diacritic, as there is some uses
of the character (according to grammatical rules) after a punctuation sign
used to separate it from an imported foreign word (most often a proper
name), sometimes written with another script. So the sign as its own lexical
and grammatical semantic, and does not really combine like other diacritics.
>
> You should better handle it as alphabetic (and this is reflected by its
general category which indicates it is a letter). For your application, the
isalpha() C function is generally used to create word tokens. The word
tokenization often requires grouping letters and diacritics at least,
without creating a break between a previous character and the prolonged
sound mark. Because the character is not combining (it can be used after a
punctuation or separator or symbol to prolonge the sound before this
punctuation), it needs to be handled as alphabetic.
>
> Another case to consider is line-breaking: a line break can occur before
that character, something that would not be permitted if it was handled as a
combining character.
>
> If your isAlpha() function doesn't do that, it would require you to handle
this character as an exception in almost all cases to respect its linguistic
value. Do you need this complication in your application code?
>
> -- Philippe.
> - Original Message - 
> From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Thursday, June 05, 2003 1:11 AM
> Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
>
>
> > All,
> > I am investigating differing behavior in various environments of the
> > wide-character version of the C function isAlpha with respect to
> > character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
> > implementations indicate that it is alphabetic, some don't. I
> > suspect that other characters might be subject to the same confusion.
> >
> > The UNICODE documents seem abiguous on this point: the General
> > Catetory is "Lm" which, although informative instead of normative,
> > would seem to imply that it is alphabetic; likewise
> > DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but
> > PropList-4.0.0.txt contains two records - one indicating that it is
> > a diacritic, one that indicates it is an extender.
> >
> > On to my questions:
> >
> > Q1: Can a character be both alphabetic and diacritic?
> >
> > Q2: Is there a difinitive answer as to whether this is an alphabetic
> > character?
> >
> > Thanks in advance for answers to these questions and/or any
> > additional isight you can provide.
> >
> > Regards,
> > Rob Mount
> >
> >
> >
> >
> >
> >
> >
> >
>
>




RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-06 Thread Mount, Rob (Robert F)

Mark,

Thanks again for your response.

I understand what you say about word formation, and combining marks, and
that the Alphabetic
classification should not be limited to "L"s.  But 30FC is of General
Category "Lm" (which should be
included) and, since version 3.1, is classified explicitly as Alphabetic in
DerivedCoreProperties.txt.
(It appears that formal  expression of the Alphabetic property was moved
from  PropList.txt 
to DerivedCoreProperties.txt in 3.1.)  I don't understand why its exclusion
from the Alphabetic 
category in 3.0.1 was not an oversight.  But if not, then either the
consortium consensus on 
the classification of this character has changed, or the current
classification is in error.

Here's a little more background regarding my motivation.  The problem occurs
in a procedure
that evaluates whether a user-supplied name can be used as an identifier -
for which identification 
of alphabetic characters is important.  One implementation of isalpha(),
purportedly based on 
Unicode 2.1, indicates that 30FC is an alpha character.  The current
implementation from the 
same vendor, based on 3.0.1, classifies it as non-alpha.  Presumably the
next one will be based 
on 3.1 or later and will reclassify it, again, as alpha.

I have since discovered section 5.16 of the spec which describes the Unicode
standard for
identifier formation, and frankly, our validation algorithm is a bit naive
and will require some 
work.  But our use of isalpha() is not, I think, fundamentally flawed; the
changes will require 
only that we include some additional characters that are not currently
considered valid.  
Certainly if the behavior of isalpha() did not change the existing algorithm
would at least 
be stable across different platforms, warts and all.  If we can't depend on
uniform behavior 
of isalpha() we will have to eliminate its use from our validation function.

So I am trying to discover why the behavior of isalpha() has changed.  Here
are the
possibilities: 1) the previous implementation was incorrect and the current
one is fixed;
2) the current implementation is flawed because it does not conform to the
documented
standard; 3) the current implementation is flawed because it's based on
incorrect 
documentation of the standard; 4) both implementations are correct but are
based on
different, incompatible standards; 5) something else I don't yet understand.

The overriding assumption for this entire discussion is that the behavior of
isalpha() should 
be governed by the Unicode Alphabetic property.  That seems reasonable to me
and is, in
fact, the vendor's claim.  If not, (or even if so) perhaps someone can
recommend a better 
(or more stable) API for discovery of Unicode character metrics upon which
we might base 
our identifier validation and other character processing logic.

Comments anyone?

Rob







-Original Message-
From: Mark Davis [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 05, 2003 2:48 PM
To: Mount, Rob (Robert F); [EMAIL PROTECTED]
Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
MARK


Ah, I see why you didn't find the Alphabetic property. It was added in
Unicode 3.1.0 (March 2001), precisely to capture characters that are
not L yet are still alphabetic. If you look at the derivation in
C:\DATA\UCD\3.1.0-Update\DerivedCoreProperties-3.1.0.txt, you will
see:

# Derived Property: Alphabetic
#  Generated from: Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic

So Alphabetic includes all L's, but also other characters. And, as I
said, it alone is not sufficient for word breaks.

> Is the ommision of 30FC from the Alphabetic category of PropList.txt
an
> error?

This is not an oversight. As I said, many characters are not
Alphabetic and are still part of words. Examples include that
character and many others. As a simple case, "can't" is a word in
English, although the apostrophe is not alphabetic. There are many,
many examples using combining marks, such as a virama (halant) in
Hindi, which is not Alphabetic:

http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=094D

So if you want reasonable word-breaks, you need to use more than the L
category, you need to look at
> http://www.unicode.org/reports/tr14/
> http://www.unicode.org/reports/tr29/

Mark
__
http://www.macchiato.com
?  "Eppur si muove" ?

- Original Message - 
From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Thursday, June 05, 2003 11:57
Subject: RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
SOUND MARK


>
> Thanks to all who responded.  The insight you provided is
invaluable.  And I
>
> appreciate your patience with a UNICODE beginner.
>
> Mark's reference to UCD.html, and by inference to
DerivedCoreProperties.txt,
> seems difinitive. 

Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-06 Thread Mark Davis
Ah, I see why you didn't find the Alphabetic property. It was added in
Unicode 3.1.0 (March 2001), precisely to capture characters that are
not L yet are still alphabetic. If you look at the derivation in
C:\DATA\UCD\3.1.0-Update\DerivedCoreProperties-3.1.0.txt, you will
see:

# Derived Property: Alphabetic
#  Generated from: Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic

So Alphabetic includes all L's, but also other characters. And, as I
said, it alone is not sufficient for word breaks.

> Is the ommision of 30FC from the Alphabetic category of PropList.txt
an
> error?

This is not an oversight. As I said, many characters are not
Alphabetic and are still part of words. Examples include that
character and many others. As a simple case, "can't" is a word in
English, although the apostrophe is not alphabetic. There are many,
many examples using combining marks, such as a virama (halant) in
Hindi, which is not Alphabetic:

http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=094D

So if you want reasonable word-breaks, you need to use more than the L
category, you need to look at
> http://www.unicode.org/reports/tr14/
> http://www.unicode.org/reports/tr29/

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Thursday, June 05, 2003 11:57
Subject: RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
SOUND MARK


>
> Thanks to all who responded.  The insight you provided is
invaluable.  And I
>
> appreciate your patience with a UNICODE beginner.
>
> Mark's reference to UCD.html, and by inference to
DerivedCoreProperties.txt,
> seems difinitive.  However, these are part of the 4.0 spec.  The
suspect
> implementation of isalpha is based, according to the vendor, on
3.0.1.
>
> The vendor relys, instead, on
> http://www.unicode.org/Public/3.0-Update1/PropList-3.0.1.txt
> which classifies 30FC as Diacritic, Extender, Bidi: Left-to-Right,
and
> Identifier Part, but not
> as Alphabetic.  Is this an error in the specification?  I could find
no
> reference to the Alphabetic
> property in the 3.0.1 documentation except in, and with reference
to,
> PropList-3.0.1.txt.
> However, it would seem, from the 4.0 documentation, that all
characters
> having a General
> Category beginning with "L" should be considered as letters, and
hence,
> implicitly, as Alphabetic.
>
> Is this, indeed, the intent of the General Category classifications
> beginning with "L"?
>
> Is the ommision of 30FC from the Alphabetic category of PropList.txt
an
> error?
>
> Rob
>
> -Original Message-----
> From: Mark Davis [mailto:[EMAIL PROTECTED]
> Sent: Thursday, June 05, 2003 9:28 AM
> To: Mount, Rob (Robert F); [EMAIL PROTECTED]
> Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
SOUND
> MARK
>
>
> The UCD has a property explicitly called "Alphabetic" in the UCD. So
> that should be used when determining whether a character is, well,
> alphabetic. See http://www.unicode.org/Public/UNIDATA/UCD.html
>
> However, in the past many people have misused functions like
isAlpha()
> for doing more complicated processing like determining text
boundaries
> (line and word breaks, for example). The function isAlpha() does not
> discriminate finely enough to be very accurate for processing like
> that. For more information, see
> http://www.unicode.org/reports/tr14/
> http://www.unicode.org/reports/tr29/
>
> Also see the proposed update to Unicode Regular Expressions, for
> discussion of the use of Unicode properties in connection with
alpha,
> punct, etc. (in the context of regular expressions, at least).
>
http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties
>
> Mark
> __
> http://www.macchiato.com
> ?  "Eppur si muove" ?
>
> - Original Message - 
> From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Wednesday, June 04, 2003 16:11
> Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
> MARK
>
>
> > All,
> > I am investigating differing behavior in various environments of
the
> > wide-character version of the C function isAlpha with respect to
> > character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
> > implementations indicate that it is alphabetic, some don't. I
> > suspect that other characters might be subject to the same
> confusion.
> >
> > The UNICODE documents seem abiguous on this point: the General
> > Catetory is "Lm" which, although in

RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-06 Thread Mount, Rob (Robert F)

Thanks to all who responded.  The insight you provided is invaluable.  And I

appreciate your patience with a UNICODE beginner.

Mark's reference to UCD.html, and by inference to DerivedCoreProperties.txt,
seems difinitive.  However, these are part of the 4.0 spec.  The suspect
implementation of isalpha is based, according to the vendor, on 3.0.1.

The vendor relys, instead, on
http://www.unicode.org/Public/3.0-Update1/PropList-3.0.1.txt
which classifies 30FC as Diacritic, Extender, Bidi: Left-to-Right, and
Identifier Part, but not
as Alphabetic.  Is this an error in the specification?  I could find no
reference to the Alphabetic
property in the 3.0.1 documentation except in, and with reference to,
PropList-3.0.1.txt.
However, it would seem, from the 4.0 documentation, that all characters
having a General 
Category beginning with "L" should be considered as letters, and hence,
implicitly, as Alphabetic.

Is this, indeed, the intent of the General Category classifications
beginning with "L"?

Is the ommision of 30FC from the Alphabetic category of PropList.txt an
error?

Rob

-Original Message-
From: Mark Davis [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 05, 2003 9:28 AM
To: Mount, Rob (Robert F); [EMAIL PROTECTED]
Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
MARK


The UCD has a property explicitly called "Alphabetic" in the UCD. So
that should be used when determining whether a character is, well,
alphabetic. See http://www.unicode.org/Public/UNIDATA/UCD.html

However, in the past many people have misused functions like isAlpha()
for doing more complicated processing like determining text boundaries
(line and word breaks, for example). The function isAlpha() does not
discriminate finely enough to be very accurate for processing like
that. For more information, see
http://www.unicode.org/reports/tr14/
http://www.unicode.org/reports/tr29/

Also see the proposed update to Unicode Regular Expressions, for
discussion of the use of Unicode properties in connection with alpha,
punct, etc. (in the context of regular expressions, at least).
http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties

Mark
__
http://www.macchiato.com
?  "Eppur si muove" ?

- Original Message - 
From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, June 04, 2003 16:11
Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
MARK


> All,
> I am investigating differing behavior in various environments of the
> wide-character version of the C function isAlpha with respect to
> character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
> implementations indicate that it is alphabetic, some don't. I
> suspect that other characters might be subject to the same
confusion.
>
> The UNICODE documents seem abiguous on this point: the General
> Catetory is "Lm" which, although informative instead of normative,
> would seem to imply that it is alphabetic; likewise
> DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but
> PropList-4.0.0.txt contains two records - one indicating that it is
> a diacritic, one that indicates it is an extender.
>
> On to my questions:
>
> Q1: Can a character be both alphabetic and diacritic?
>
> Q2: Is there a difinitive answer as to whether this is an alphabetic
> character?
>
> Thanks in advance for answers to these questions and/or any
> additional isight you can provide.
>
> Regards,
> Rob Mount
>
>
>
>
>
>
>
>
>



Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-06 Thread Mark Davis
The UCD has a property explicitly called "Alphabetic" in the UCD. So
that should be used when determining whether a character is, well,
alphabetic. See http://www.unicode.org/Public/UNIDATA/UCD.html

However, in the past many people have misused functions like isAlpha()
for doing more complicated processing like determining text boundaries
(line and word breaks, for example). The function isAlpha() does not
discriminate finely enough to be very accurate for processing like
that. For more information, see
http://www.unicode.org/reports/tr14/
http://www.unicode.org/reports/tr29/

Also see the proposed update to Unicode Regular Expressions, for
discussion of the use of Unicode properties in connection with alpha,
punct, etc. (in the context of regular expressions, at least).
http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, June 04, 2003 16:11
Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
MARK


> All,
> I am investigating differing behavior in various environments of the
> wide-character version of the C function isAlpha with respect to
> character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
> implementations indicate that it is alphabetic, some don't. I
> suspect that other characters might be subject to the same
confusion.
>
> The UNICODE documents seem abiguous on this point: the General
> Catetory is "Lm" which, although informative instead of normative,
> would seem to imply that it is alphabetic; likewise
> DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but
> PropList-4.0.0.txt contains two records - one indicating that it is
> a diacritic, one that indicates it is an extender.
>
> On to my questions:
>
> Q1: Can a character be both alphabetic and diacritic?
>
> Q2: Is there a difinitive answer as to whether this is an alphabetic
> character?
>
> Thanks in advance for answers to these questions and/or any
> additional isight you can provide.
>
> Regards,
> Rob Mount
>
>
>
>
>
>
>
>
>




RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-05 Thread Marco Cimarosti
Rob Mount
> Q1: Can a character be both alphabetic and diacritic?

I would say yes. My understanding of the "Lm" general category is: "a
diacritic letter".

> Q2: Is there a difinitive answer as to whether this is an alphabetic 
> character?

Strictly speaking, as katakana and hiragana are not alphabets, their letters
cannot be called "alphabetic".

But I guess that you mean "alphabetic" is the sense that isalpha() should
return TRUE for it, i.e. in the sense that it is a character used to write
*words* in the orthography of some language. In this sense, yes, IMHO:
isalpha() should return true for all the characters having general category
"L...".

_ Marco



Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-05 Thread Philippe Verdy
My opinion is that it can be viewed, depending on its application, as a letter (for 
some transliteration purpose), or as a diacritic (for some other transliterations). 
But in reality it is mostly a letter modifier. For UCA, it sorts mostly like the base 
letter that it modifies, and UCA gives the most appropriate linguistic value of this 
character.

This is not the only character of this type in Unicode. You'll find similar sound 
marks (length marks, repeat marks) in other scripts, including abjads, and IPA (the 
IPA column-like sign for example).

For Japanese people, they consider this sign as a separate vowel whose phonetic value 
depends on the phonetic value of the previous character (which may have a point or 
double-point diacritic, for the voice mark used to alter the consonnant value of the 
base character). This is proably why the transliteration of this character to Latin 
generally doubles the previous Latin vowel.

However, this character is not strictly a diacritic, as there is some uses of the 
character (according to grammatical rules) after a punctuation sign used to separate 
it from an imported foreign word (most often a proper name), sometimes written with 
another script. So the sign as its own lexical and grammatical semantic, and does not 
really combine like other diacritics.

You should better handle it as alphabetic (and this is reflected by its general 
category which indicates it is a letter). For your application, the isalpha() C 
function is generally used to create word tokens. The word tokenization often requires 
grouping letters and diacritics at least, without creating a break between a previous 
character and the prolonged sound mark. Because the character is not combining (it can 
be used after a punctuation or separator or symbol to prolonge the sound before this 
punctuation), it needs to be handled as alphabetic.

Another case to consider is line-breaking: a line break can occur before that 
character, something that would not be permitted if it was handled as a combining 
character.

If your isAlpha() function doesn't do that, it would require you to handle this 
character as an exception in almost all cases to respect its linguistic value. Do you 
need this complication in your application code?

-- Philippe.
- Original Message - 
From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, June 05, 2003 1:11 AM
Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK


> All,
> I am investigating differing behavior in various environments of the 
> wide-character version of the C function isAlpha with respect to 
> character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some 
> implementations indicate that it is alphabetic, some don't. I 
> suspect that other characters might be subject to the same confusion.
> 
> The UNICODE documents seem abiguous on this point: the General 
> Catetory is "Lm" which, although informative instead of normative, 
> would seem to imply that it is alphabetic; likewise 
> DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but 
> PropList-4.0.0.txt contains two records - one indicating that it is 
> a diacritic, one that indicates it is an extender.
> 
> On to my questions:
> 
> Q1: Can a character be both alphabetic and diacritic?
> 
> Q2: Is there a difinitive answer as to whether this is an alphabetic 
> character?
> 
> Thanks in advance for answers to these questions and/or any 
> additional isight you can provide.
> 
> Regards,
> Rob Mount
> 
> 
> 
> 
> 
> 
> 
> 



Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-05 Thread Andrew C. West
On Wed, 4 Jun 2003 18:11:48 -0500 , "Mount, Rob (Robert F)" wrote:

> I am investigating differing behavior in various environments of the 
> wide-character version of the C function isAlpha with respect to 
> character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK.
> 
> The UNICODE documents seem abiguous on this point: the General 
> Catetory is "Lm" which, although informative instead of normative, 
> would seem to imply that it is alphabetic; likewise 
> DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic; but 
> PropList-4.0.0.txt contains two records - one indicating that it is 
> a diacritic, one that indicates it is an extender.

U+30FC (KATAKANA-HIRAGANA PROLONGED SOUND MARK) is, I would say, identical in
function to U+02D0 (MODIFIER LETTER TRIANGULAR COLON) that is used to indicate a
long vowel in IPA. Both U+30FC and U+02D0 are signs that are appended to a
character representing a vowel to indicate that it is a long vowel sound.

Both U+30FC and U+02D0 have a General Category of "Lm" (Modifier_Letter), and in
PropList.txt are included under the Extender property. However only U+30FC is
also included under the Diacritic property. Likewise, U+1843 (MONGOLIAN LETTER
TODO LONG VOWEL SIGN), which has a similar function to U+30FC, is classified as
an Extender but not as a Diacritic.

The definition of "Extender" in UCD.html is :

"Characters whose principal function is to extend the value or shape of a
preceding alphabetic character. Typical of these are length and iteration marks."

U+30FC, U+02D0 and U+30FC are indeed all "length marks", and are rightly
classified as Extenders.

But why then is U+30FC alone also classified as a Diacritic (according to
UCD.html "Characters that linguistically modify the meaning of another character
to which they apply") ? As far as I am aware U+30FC does not "linguistically
modify the meaning of another character" other than lengthen a preceding vowel.

Andrew