Re: full-width Latin missing from confusables data

2013-10-29 Thread Mark Davis ☕
FYI, I just submitted a doc to the UTC for the upcoming meeting:

#36 & #39 Recommendations

http://goo.gl/NKeRVB

If there is any feedback you'd like me to incorporate in a revision before
the meeting, please let me know.

Mark


Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Tue, Oct 15, 2013 at 8:53 PM, Mark Davis ☕  wrote:

> > but as Michel mentioned the data
> does not seem consistent in that case.
> ​
>
> You might add that to your report​...
>
>
>
> Mark 
> *
> *
> *— Il meglio è l’inimico del bene —*
> **
>
>
> On Tue, Oct 15, 2013 at 7:23 PM, Chris Weber  wrote:
>
>> On 10/14/2013 12:40 AM, Mark Davis ☕ wrote:
>> > For the confusables, the presumption is that implementations have
>> > already either normalized the input to NFKC or have rejected input that
>> > is not NFKC.
>>
>> Thanks for the explanation Mark.  It makes sense for implementations
>> which want to detect confusability, but as Michel mentioned the data
>> does not seem consistent in that case.  Another case could be
>> implementations which want to generate confusable strings for testing -
>> do you think those could be improved by having this extra data?  For
>> example:
>>
>> http://unicode.org/cldr/utility/confusables.jsp?a=m&r=None
>>
>> > It would probably be worth clarifying this in the text of
>> > http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an
>> > upcoming UTC meeting at the start of Nov., so if you want to suggest
>> > that or any other improvements, you should use the
>> > http://www.unicode.org/reporting.html.
>>
>> Thank you, I'll file a report.
>>
>> --
>> Best regards,
>> Chris Weber - ch...@lookout.net - http://www.lookout.net
>> PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7
>>
>
>


Re: full-width Latin missing from confusables data

2013-10-15 Thread Mark Davis ☕
> but as Michel mentioned the data
does not seem consistent in that case.
​

You might add that to your report​...



Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Tue, Oct 15, 2013 at 7:23 PM, Chris Weber  wrote:

> On 10/14/2013 12:40 AM, Mark Davis ☕ wrote:
> > For the confusables, the presumption is that implementations have
> > already either normalized the input to NFKC or have rejected input that
> > is not NFKC.
>
> Thanks for the explanation Mark.  It makes sense for implementations
> which want to detect confusability, but as Michel mentioned the data
> does not seem consistent in that case.  Another case could be
> implementations which want to generate confusable strings for testing -
> do you think those could be improved by having this extra data?  For
> example:
>
> http://unicode.org/cldr/utility/confusables.jsp?a=m&r=None
>
> > It would probably be worth clarifying this in the text of
> > http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an
> > upcoming UTC meeting at the start of Nov., so if you want to suggest
> > that or any other improvements, you should use the
> > http://www.unicode.org/reporting.html.
>
> Thank you, I'll file a report.
>
> --
> Best regards,
> Chris Weber - ch...@lookout.net - http://www.lookout.net
> PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7
>


Re: full-width Latin missing from confusables data

2013-10-15 Thread Chris Weber
On 10/14/2013 12:40 AM, Mark Davis ☕ wrote:
> For the confusables, the presumption is that implementations have
> already either normalized the input to NFKC or have rejected input that
> is not NFKC. 

Thanks for the explanation Mark.  It makes sense for implementations
which want to detect confusability, but as Michel mentioned the data
does not seem consistent in that case.  Another case could be
implementations which want to generate confusable strings for testing -
do you think those could be improved by having this extra data?  For
example:

http://unicode.org/cldr/utility/confusables.jsp?a=m&r=None

> It would probably be worth clarifying this in the text of
> http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an
> upcoming UTC meeting at the start of Nov., so if you want to suggest
> that or any other improvements, you should use the
> http://www.unicode.org/reporting.html.

Thank you, I'll file a report.

-- 
Best regards,
Chris Weber - ch...@lookout.net - http://www.lookout.net
PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7



RE: full-width Latin missing from confusables data

2013-10-14 Thread Michel Suignard
> For the confusables, the presumption is that implementations have already 
> either normalized the input to NFKC or have rejected input that is not NFKC.

Agree with that as well, however the data is not consistent by having some of 
these fullwidth latin characters in the data but not all of them. Either we 
should have none or all of them. I would be tempted to remove the ones in the 
set. Anyone concerned about confusability ought to apply NFKC first (or make 
sure that the target repertoire is stable through a NFKC operation).
Michel



Re: full-width Latin missing from confusables data

2013-10-14 Thread Mark Davis ☕
For the confusables, the presumption is that implementations have already
either normalized the input to NFKC or have rejected input that is not
NFKC.

More broadly, in gathering data the main emphasis is on characters that fit
the profile in http://www.unicode.org/reports/tr39/#Identifier_Characters,
including scripts like Cyrillic (
http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). So while
we do add characters outside of that, there has been no concerted effort to
do so.

In particular, in your identifiers you should not allow scripts like
Buginese (
http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Exclusion_from_Identifiers)
or
Lisu (http://www.unicode.org/reports/tr31/#Table_Limited_Use_Scripts)
without recognizing that the confusable data will be sketchy for those.

It would probably be worth clarifying this in the text of
http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an
upcoming UTC meeting at the start of Nov., so if you want to suggest that
or any other improvements, you should use the
http://www.unicode.org/reporting.html.


Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Sun, Oct 13, 2013 at 7:36 PM, Chris Weber  wrote:

> While looking closer at the current confusables data, I've noticed that
> several of the fullwidth code points seem to be missing from the
> confusables data. For example, U+FF4D FULLWIDTH LATIN SMALL LETTER M
> does not exist as a confusable for U+006D LATIN SMALL LETTER M, as well
> as several others I've noticed.
>
> Was this intentional?
>
> Also, I'm not clear on the difference between the confusables.txt and
> confusablesSummary.txt - are these meant to provide the same data in
> different formats?
>
> --
> Best regards,
> Chris Weber - ch...@lookout.net - http://www.lookout.net
> PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7
>
>