[Encode] Compound Unicode Character Support in UCM

Dan Kogai Mon, 01 Apr 2002 04:45:04 -0800

On Monday, April 1, 2002, at 09:08 , Dan Kogai wrote:
> On Monday, April 1, 2002, at 08:40 , Nick Ing-Simmons wrote:
>>>   I have recently found this undocumented feature but dared not use 
>>> it.
>>
>> I was not aware it was actually implemented ;-)
>
> Well, half of it.  the regex that catches multiple <U...> was there but 
> only the first one was used and the multiple occurance of <U...> croaks 
> with a "Bad line:" message.  But this error was good enough for me to 
> find where to fix.


   And here is the quick fix to enc2xs that allows multiple occurance of 
<U...>.  It's slightly faster too because there is no backtracking.

--- bin/enc2xs  2002/03/31 21:00:50     1.10
+++ bin/enc2xs  2002/04/01 12:55:37
@@ -381,16 +381,15 @@
     s/#.*$//;
     last if /^\s*END\s+CHARMAP\s*$/i;
     next if /^\s*$/;
-   my ($u,@byte);
-   my $fb = '';
-   $u = $1 if (/^<U([0-9a-f]+)>\s+/igc);
-   push(@byte,$1) while /\G\\x([0-9a-f]+)/igc;
-   $fb = $1 if /\G\s*(\|[0-3])/gc;
-   # warn "$_: $u @byte | $fb\n";
-   die "Bad line:$_" unless /\G\s*(#.*)?$/gc;
-   if (defined($u))
+   my (@uni, @byte) = ();
+   my ($uni, $byte, $fb) = m/^(\S+)\s+(\S+)\s+(\S+)\s+/o
+       or die "Bad line: $_";
+   push @uni, $1  while ($uni =~  m/\G<U([0-9a-fA-F]+)>/g);
+   # warn join(",", @uni);
+   push @byte, $1 while ($byte =~ m/\G\\x([0-9a-fA-F]+)/g);
+   if (@uni)
      {
-     my $uch = encode_U(hex($u));
+     my $uch =  join('', map { encode_U(hex($_)) } @uni );
       my $ech = join('',map(chr(hex($_)),@byte));
       my $el  = length($ech);
       $max_el = $el if (!defined($max_el) || $el > $max_el);

   The quick test against freshly brew macJapan.ucm (freshly created out 
of JAPANESE.txt at unicode.org) has shown it is working.

>
>>>   I think it looks better if it were written as
>>>
>>> <UNNNN+UMMMM> \xYY\xYY ....
>>
>> I don't like the <UNNNN+UMMMM> part it will make the parsing messier.
>>
>> The \xYY\xYY is of course what I meant ;-)
>
> Not that much.  It's just a regex after all.  Let's TIMTOWTDI it.

   <U...><U...> has already been working.  <U...+U...> soon to come.

Dan the Encode Maintainer

[Encode] Compound Unicode Character Support in UCM

Reply via email to