On Fri, Oct 15, 2010 at 3:19 PM, Tim Greenwood <timo...@greenwood.name>wrote:
> Is there any regular expression - in perl, or elsewhere, that enables > searching on the derived age? I want to find all characters in a file added > since Unicode 4.1. > I could write it all by processing against the derived age file, but it > would be nice if it is ready to go. > You could use an ICU UnicodeSet or an ICU regular expression. http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:^Cn:]%26[:^age%3D4.1:]]&abb=on&g= http://userguide.icu-project.org/strings/unicodeset http://userguide.icu-project.org/strings/regexp A (frozen) UnicodeSet with its span() or spanUTF8() method might suffice, depending on what you need. We also have dedicated API (UCharacter.java/uchar.h) for the non-Unihan properties. Note what UTS #18 <http://www.unicode.org/reports/tr18/> says about [:age:] or \p{age} (which ICU implements): *Age **Caution:* The DerivedAge<http://www.unicode.org/Public/UNIDATA/DerivedAge.txt> data file in the UCD provides the deltas between versions, for compactness. However, when using the property all characters included in that version are included. Thus\p{age=3.0} includes the letter *a*, which was included in Unicode 1.0. To get characters that are new in a particular version, subtract off the previous version as described in 1.3 Subtraction and Intersection<http://www.unicode.org/reports/tr18/#Subtraction_and_Intersection>. For example: [\p{age=3.1} -- \p{age=3.0}] Best regards, markus