Re: Need: list of Unicode characters that have canonical decompositions.

Karl Williamson Mon, 27 Jun 2011 13:02:14 -0700

On 06/27/2011 08:26 AM, BobH wrote:

A project I'm working on needs to build a list of all Unicode characters
that have canonical decompositions. The most efficient ways I can think
of to get such a list are from unicore/Decomposition.pl or by scanning
unicore/UnicodeData.txt. However:


Re unicore/Decomposition.pl, the header of this says:

# !!!!!!! INTERNAL PERL USE ONLY !!!!!!!
# This file is for internal use by the Perl program only. The format
and even
# the name or existence of this file are subject to change without
notice.
# Don't use it directly.


Re unicore/UnicodeData.txt, I've recently posted a version of my module
that uses unicore/UnicodeData.txt to CPAN, and from Perl 5.14 testers
I've received only failure notices which indicate that the file cannot
be found :-(

Unicode::UCD can tell me if a specific character has a decomposition,
but can't give me a list of characters that have decompositions.

Any suggestions would be appreciated.

Bob

I'm presuming you need this not for a one-time only thing, but to beable to run this program over and over. You can always downloadUnicodeData.txt from the Unicode web site. In a regular expression,\p{Dt= can} (Decomposition_Type=Canonical) will match all charactersthat you want. I'm thinking that 5.16 will have the stringification ofthat regex include the list you want, but not in 5.14, andstringification is not necessarily fixed either.

I could easily write a new function for UCD that returns a list of allcode points that have a given property.

Re: Need: list of Unicode characters that have canonical decompositions.

Reply via email to