Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

Mark Davis Fri, 06 Jun 2003 06:58:28 -0700

Ah, I see why you didn't find the Alphabetic property. It was added in
Unicode 3.1.0 (March 2001), precisely to capture characters that are
not L yet are still alphabetic. If you look at the derivation in
C:\DATA\UCD\3.1.0-Update\DerivedCoreProperties-3.1.0.txt, you will
see:


# Derived Property: Alphabetic
#  Generated from: Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic

So Alphabetic includes all L's, but also other characters. And, as I
said, it alone is not sufficient for word breaks.

> Is the ommision of 30FC from the Alphabetic category of PropList.txt
an
> error?

This is not an oversight. As I said, many characters are not
Alphabetic and are still part of words. Examples include that
character and many others. As a simple case, "can't" is a word in
English, although the apostrophe is not alphabetic. There are many,
many examples using combining marks, such as a virama (halant) in
Hindi, which is not Alphabetic:

http://oss.software.ibm.com/cgi-bin/icu/ub/utf-8/?ch=094D

So if you want reasonable word-breaks, you need to use more than the L
category, you need to look at
> http://www.unicode.org/reports/tr14/
> http://www.unicode.org/reports/tr29/

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message ----- 
From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Thursday, June 05, 2003 11:57
Subject: RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
SOUND MARK


>
> Thanks to all who responded.  The insight you provided is
invaluable.  And I
>
> appreciate your patience with a UNICODE beginner.
>
> Mark's reference to UCD.html, and by inference to
DerivedCoreProperties.txt,
> seems difinitive.  However, these are part of the 4.0 spec.  The
suspect
> implementation of isalpha is based, according to the vendor, on
3.0.1.
>
> The vendor relys, instead, on
> http://www.unicode.org/Public/3.0-Update1/PropList-3.0.1.txt
> which classifies 30FC as Diacritic, Extender, Bidi: Left-to-Right,
and
> Identifier Part, but not
> as Alphabetic.  Is this an error in the specification?  I could find
no
> reference to the Alphabetic
> property in the 3.0.1 documentation except in, and with reference
to,
> PropList-3.0.1.txt.
> However, it would seem, from the 4.0 documentation, that all
characters
> having a General
> Category beginning with "L" should be considered as letters, and
hence,
> implicitly, as Alphabetic.
>
> Is this, indeed, the intent of the General Category classifications
> beginning with "L"?
>
> Is the ommision of 30FC from the Alphabetic category of PropList.txt
an
> error?
>
> Rob
>
> -----Original Message-----
> From: Mark Davis [mailto:[EMAIL PROTECTED]
> Sent: Thursday, June 05, 2003 9:28 AM
> To: Mount, Rob (Robert F); [EMAIL PROTECTED]
> Subject: Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED
SOUND
> MARK
>
>
> The UCD has a property explicitly called "Alphabetic" in the UCD. So
> that should be used when determining whether a character is, well,
> alphabetic. See http://www.unicode.org/Public/UNIDATA/UCD.html
>
> However, in the past many people have misused functions like
isAlpha()
> for doing more complicated processing like determining text
boundaries
> (line and word breaks, for example). The function isAlpha() does not
> discriminate finely enough to be very accurate for processing like
> that. For more information, see
> http://www.unicode.org/reports/tr14/
> http://www.unicode.org/reports/tr29/
>
> Also see the proposed update to Unicode Regular Expressions, for
> discussion of the use of Unicode properties in connection with
alpha,
> punct, etc. (in the context of regular expressions, at least).
>
http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties
>
> Mark
> __________________________________
> http://www.macchiato.com
> ?  "Eppur si muove" ?
>
> ----- Original Message ----- 
> From: "Mount, Rob (Robert F)" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Wednesday, June 04, 2003 16:11
> Subject: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND
> MARK
>
>
> > All,
> > I am investigating differing behavior in various environments of
the
> > wide-character version of the C function isAlpha with respect to
> > character U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK. Some
> > implementations indicate that it is alphabetic, some don't. I
> > suspect that other characters might be subject to the same
> confusion.
> >
> > The UNICODE documents seem abiguous on this point: the General
> > Catetory is "Lm" which, although informative instead of normative,
> > would seem to imply that it is alphabetic; likewise
> > DerivedCoreProperties-4.0.0.txt indicates that it is alphabetic;
but
> > PropList-4.0.0.txt contains two records - one indicating that it
is
> > a diacritic, one that indicates it is an extender.
> >
> > On to my questions:
> >
> > Q1: Can a character be both alphabetic and diacritic?
> >
> > Q2: Is there a difinitive answer as to whether this is an
alphabetic
> > character?
> >
> > Thanks in advance for answers to these questions and/or any
> > additional isight you can provide.
> >
> > Regards,
> > Rob Mount
> >
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

Reply via email to