On Monday, April 1, 2002, at 09:08 , Dan Kogai wrote: > On Monday, April 1, 2002, at 08:40 , Nick Ing-Simmons wrote: >>> I have recently found this undocumented feature but dared not use >>> it. >> >> I was not aware it was actually implemented ;-) > > Well, half of it. the regex that catches multiple <U...> was there but > only the first one was used and the multiple occurance of <U...> croaks > with a "Bad line:" message. But this error was good enough for me to > find where to fix.
And here is the quick fix to enc2xs that allows multiple occurance of <U...>. It's slightly faster too because there is no backtracking. --- bin/enc2xs 2002/03/31 21:00:50 1.10 +++ bin/enc2xs 2002/04/01 12:55:37 @@ -381,16 +381,15 @@ s/#.*$//; last if /^\s*END\s+CHARMAP\s*$/i; next if /^\s*$/; - my ($u,@byte); - my $fb = ''; - $u = $1 if (/^<U([0-9a-f]+)>\s+/igc); - push(@byte,$1) while /\G\\x([0-9a-f]+)/igc; - $fb = $1 if /\G\s*(\|[0-3])/gc; - # warn "$_: $u @byte | $fb\n"; - die "Bad line:$_" unless /\G\s*(#.*)?$/gc; - if (defined($u)) + my (@uni, @byte) = (); + my ($uni, $byte, $fb) = m/^(\S+)\s+(\S+)\s+(\S+)\s+/o + or die "Bad line: $_"; + push @uni, $1 while ($uni =~ m/\G<U([0-9a-fA-F]+)>/g); + # warn join(",", @uni); + push @byte, $1 while ($byte =~ m/\G\\x([0-9a-fA-F]+)/g); + if (@uni) { - my $uch = encode_U(hex($u)); + my $uch = join('', map { encode_U(hex($_)) } @uni ); my $ech = join('',map(chr(hex($_)),@byte)); my $el = length($ech); $max_el = $el if (!defined($max_el) || $el > $max_el); The quick test against freshly brew macJapan.ucm (freshly created out of JAPANESE.txt at unicode.org) has shown it is working. > >>> I think it looks better if it were written as >>> >>> <UNNNN+UMMMM> \xYY\xYY .... >> >> I don't like the <UNNNN+UMMMM> part it will make the parsing messier. >> >> The \xYY\xYY is of course what I meant ;-) > > Not that much. It's just a regex after all. Let's TIMTOWTDI it. <U...><U...> has already been working. <U...+U...> soon to come. Dan the Encode Maintainer