Re: Need: list of Unicode characters that have canonical decompositions.

BobH Mon, 27 Jun 2011 19:04:25 -0700

Karl Williamson wrote:

> I'm presuming you need this not for a one-time only thing, but to be
>  able to run this program over and over.

Yes -- this is for a module that will be usable in a number ofsituations. Seehttp://search.cpan.org/~bhallissy/Text-Unicode-Equivalents-0.05/.

The current implementation cheats by accessing unicore/Decomposition.plexactly the same way Unicode::UCD does.


> You can always download UnicodeData.txt from the Unicode web site.

Yes I can -- and certainly have done for my personal use. But includingthat file (or some derivative) in a general purpose module would meanthat it wouldn't necessarily have the same Unicode version as the Perlinstallation into which my module might be installed. And besides, theinformation I need is already in the Perl core -- though supposedly notusable.


> In a regular expression,
> \p{Dt= can} (Decomposition_Type=Canonical) will match all characters
>  that you want.

Yes, I understand that I can test a character to see if it has aparticular decomposition, but I'm not sure I understand how to use aregex to generate a complete list of characters with decompositions.


> I'm thinking that 5.16 will have the stringification
> of that regex include the list you want, but not in 5.14, and
> stringification is not necessarily fixed either.
>
> I could easily write a new function for UCD that returns a list of
> all code points that have a given property.

That is an interesting offer, and I think this should be given seriousconsideration. I'm sure my little module isn't the only one that, as wego into the future, would benefit from such a function.


Thanks for your reply, Karl.

Bob

Re: Need: list of Unicode characters that have canonical decompositions.

Reply via email to