On 06/27/2011 08:26 AM, BobH wrote:
A project I'm working on needs to build a list of all Unicode characters
that have canonical decompositions. The most efficient ways I can think
of to get such a list are from unicore/Decomposition.pl or by scanning
unicore/UnicodeData.txt. However:

Re unicore/Decomposition.pl, the header of this says:

# !!!!!!! INTERNAL PERL USE ONLY !!!!!!!
# This file is for internal use by the Perl program only. The format
and even
# the name or existence of this file are subject to change without
notice.
# Don't use it directly.

Re unicore/UnicodeData.txt, I've recently posted a version of my module
that uses unicore/UnicodeData.txt to CPAN, and from Perl 5.14 testers
I've received only failure notices which indicate that the file cannot
be found :-(

Unicode::UCD can tell me if a specific character has a decomposition,
but can't give me a list of characters that have decompositions.

Any suggestions would be appreciated.

Bob


I'm presuming you need this not for a one-time only thing, but to be able to run this program over and over. You can always download UnicodeData.txt from the Unicode web site. In a regular expression, \p{Dt= can} (Decomposition_Type=Canonical) will match all characters that you want. I'm thinking that 5.16 will have the stringification of that regex include the list you want, but not in 5.14, and stringification is not necessarily fixed either.

I could easily write a new function for UCD that returns a list of all code points that have a given property.

Reply via email to