Re: Inverse of /\p{script}/

2003-08-29 Thread Nick Ing-Simmons
Jungshik Shin <[EMAIL PROTECTED]> writes:
>
>  If you want, you can take a look at nsFontMetricsGTK.cpp file
>of mozilla. 

Can you pass on my admiration to the Mozilla team - its 
handling of these issues in version 1.4 is so much better
than ye-olde Netscape.

>You can view that huge file (over 6,000 lines of
>code) by going to http://lxr.mozilla.org and typing in the
>file name. Compare that with nsFontMetricsXft.cpp or
>nsFontMetricsWin.cpp and you'll realize the enormous difficulty of the
>problem you're trying to tackle.  See, for example,
>

Believe me I know the scope of the problem.
I started on porting the allegedly  "Unicode aware" tk8.x from Tcl/Tk 
back in 2000. The font problem has been plaguing me ever since.

Note my _personal_ care about is not CJK, nor correct rendering of 
all the Russian SPAM I get, but IPA phonetics. These suffer particularly 
badly as many of the glyphs are "borrowed" from various blocks.
But Tk's "loony" scheme means I get bold version of this, italic 
version of that, re-scaled bitmap of the other - and then the 
diacritics dumped in a heap on top. 

>
>  Perhaps, your problem is of the limited scope, but still
>it won't be a  very pleasant experience ;-)

The scope is not limited - quite the reverse - this is a toolkit, so 
I don't know what it is being used for. However that is also my "way out":
provided I can give the end-users a way to do something decent I can leave
the final compromises to them. What I don't like about current tk8 
is that it forces some very dubious compromises on the end-user.
So my postings here (to perl-unicode) are to see if perl has the 
mechanisms to help. The suggesions so far have been very useful.

>
>Can Tk use Xft and fontconfig
>(http://fontconfig.org) where/when available? 
>Using XLFD-based
>fonts (15year old) is not such a good idea if you don't have to.

Not in the imminent release - that is too big a change to core-tk I inherit.
But as I had already more or less convinced my self that was way to 
go and now Owen Taylor and you have both suggested it I will give
it serious consideration.

perl/Tk is notionaly a port of Tk from Tcl to perl, so I try not 
to mess with the fundamentals, as that just creates more work.
But this is _such_ a mess, and font handling is relatively well 
encapsulated that I really think a Xft type solution is worth a try.

>
>
>> What gets really painful is the Unicode fonts - one has to look at
>> which characters it has to decide if it
>> Japanese/Simplified Chinese/Traditional Chinese/Korean or just a grab-bag
>> of glyphs font designer had to hand.
>
>  Some Unicode fonts have a signature for 'lang'. For instance,
>'misc-fixed-iso10646-1' fonts that come with XFree86 have 'ko' and
>'ja' in add-style field. 

But no Chinese? (I don't read any of these, but I am getting a feel for 
what they should look like and I have co-workers that do read chinese.)

>Other fonts have a 'lang signature' in
>registry-encoding pairs (e.g. ucs2.korean-0)

Ah - I have not seen any of those yet.

>http://bugzilla.mozilla.org/show_bug.cgi?id=215537). As you wrote,
>most of them don't and it'll make some old system almost to a halt
>for tens of seconds if you want to open Unicode X11 core fonts
>and examine which char. is covered.

It is the freeze for 10s of seconds and then produce 
a complete mess that I feel I must fix - 
10s for beauty, or instant mess I could justify ;-)

The other irritant is that the more fonts one has the worse it gets.
The best results (for Tk) are obtained by deleting all but a selected
few fonts so it can't choose the wrong ones :-(




Re: Inverse of /\p{script}/

2003-08-29 Thread Jungshik Shin
On Fri, 29 Aug 2003, Nick Ing-Simmons wrote:

> What I am hoping to do for Tk804 is put some kind of callback to perl
> hook in so that when Tk wants a font for a particular character it
> can call to perl and perl will give it strong push in a particular direction.
> Thus for someone expecting Japanese if asked for a Han character
> it will suggest a JIS font. While for someone expecting Chinese it
> will suggest a Big5 or gb2312 font as appropriate.

  If you want, you can take a look at nsFontMetricsGTK.cpp file
of mozilla. You can view that huge file (over 6,000 lines of
code) by going to http://lxr.mozilla.org and typing in the
file name. Compare that with nsFontMetricsXft.cpp or
nsFontMetricsWin.cpp and you'll realize the enormous difficulty of the
problem you're trying to tackle.  See, for example,


  Perhaps, your problem is of the limited scope, but still
it won't be a  very pleasant experience ;-)

Can Tk use Xft and fontconfig
(http://fontconfig.org) where/when available? Using XLFD-based
fonts (15year old) is not such a good idea if you don't have to.


> What gets really painful is the Unicode fonts - one has to look at
> which characters it has to decide if it
> Japanese/Simplified Chinese/Traditional Chinese/Korean or just a grab-bag
> of glyphs font designer had to hand.

  Some Unicode fonts have a signature for 'lang'. For instance,
'misc-fixed-iso10646-1' fonts that come with XFree86 have 'ko' and
'ja' in add-style field. Other fonts have a 'lang signature' in
registry-encoding pairs (e.g. ucs2.korean-0)
http://bugzilla.mozilla.org/show_bug.cgi?id=215537). As you wrote,
most of them don't and it'll make some old system almost to a halt
for tens of seconds if you want to open Unicode X11 core fonts
and examine which char. is covered.

  Jungshik


Re: Endless loop with illegal UTF-8 in Encode.pm

2003-08-29 Thread Jarkko Hietaniemi
On Fri, Aug 29, 2003 at 11:00:38AM +0200, Sven Neuhaus wrote:
> Hi,
> 
> I'm seeing a script using XML::Simple go berserk (eats CPU + Memory) when 
> feeding it XML with illegal UTF-8.
> 
> The perl debugger is telling me it's jumping around in Encode.pm
> between line 187 ("sub decode_utf8") and line 246 ("*decode = sub {...").
> 
> It's doing something like:
>  my $str = Encode::decode_utf8($octets);
> and then
>  return undef unless utf8::decode($str);
> (2 functions calling each other).
> 
> Is this a known bug in Encode.pm? Has it been fixed?

Maybe and maybe.  Could you show the illegal UTF-8?

> My Encode.pm is Version 1.75. It's part of the perl debian package
> 5.8.0-19 (debian unstable).
> I wish Encode.pm would handle invalid UTF-8 in a graceful manner...
> 
> Cheers,
> -Sven

-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen


Re: Inverse of /\p{script}/

2003-08-29 Thread Nick Ing-Simmons
Owen Taylor <[EMAIL PROTECTED]> writes:
>On Fri, 2003-08-29 at 11:14, Nick Ing-Simmons wrote:
>> >
>> >We're dropping support for this code and for core X fonts
>> >in the next release of Pango, 
>> 
>> In favour of what? (FreeType on client side?)
>
>Yes, using the Xft and fontconfig libraries. (http://www.fontconfig.org)
>
>If you have the RENDER extension (XFree86 from the last several years,
>recent Sun X server), it takes advantage of that to do efficient
>compositing of antialiased glyphs on the server side
>
>Using legacy X servers, it renders using standard X primitives either
>antialiased or not, and the performance is quite usable.
>
>The advantages are pretty huge:
>
> - Real unicode charmaps, instead of legacy encoding schemes.
>
> - Access to all the font tables, so you can do complex-text layout
>   using OpenType
>
> - No need to pull huge metrics and charmaps tables over the 
>   X connection
>
>Etc. 

That is what I thought as well. 





Endless loop with illegal UTF-8 in Encode.pm

2003-08-29 Thread Sven Neuhaus
Hi,

I'm seeing a script using XML::Simple go berserk (eats CPU + Memory) when 
feeding it XML with illegal UTF-8.

The perl debugger is telling me it's jumping around in Encode.pm
between line 187 ("sub decode_utf8") and line 246 ("*decode = sub {...").
It's doing something like:
 my $str = Encode::decode_utf8($octets);
and then
 return undef unless utf8::decode($str);
(2 functions calling each other).
Is this a known bug in Encode.pm? Has it been fixed?

My Encode.pm is Version 1.75. It's part of the perl debian package
5.8.0-19 (debian unstable).
I wish Encode.pm would handle invalid UTF-8 in a graceful manner...
Cheers,
-Sven


Re: Inverse of /\p{script}/

2003-08-29 Thread Owen Taylor
On Fri, 2003-08-29 at 11:14, Nick Ing-Simmons wrote:
> >
> >We're dropping support for this code and for core X fonts
> >in the next release of Pango, 
> 
> In favour of what? (FreeType on client side?)

Yes, using the Xft and fontconfig libraries. (http://www.fontconfig.org)

If you have the RENDER extension (XFree86 from the last several years,
recent Sun X server), it takes advantage of that to do efficient
compositing of antialiased glyphs on the server side

Using legacy X servers, it renders using standard X primitives either
antialiased or not, and the performance is quite usable.

The advantages are pretty huge:

 - Real unicode charmaps, instead of legacy encoding schemes.

 - Access to all the font tables, so you can do complex-text layout
   using OpenType

 - No need to pull huge metrics and charmaps tables over the 
   X connection

Etc. 

Regards,
Owen




Re: Inverse of /\p{script}/

2003-08-29 Thread Nick Ing-Simmons
Owen Taylor <[EMAIL PROTECTED]> writes:
>You might want to look at what we did for Pango - see 
>pango/modules/basic/tables-big.i in
>ftp://ftp.gtk.org/pub/gtk/v2.2/pango-1.2.5.tar.gz.

[There may come a time when I just give up Tcl/Tk and implement 
perl/Tk OO interface on top of gtk instead. But not yet...]

>
>There is a big map there that for each Unicode codepoint lists
>possible encodings with a moderately clever encoding scheme to save
>memory. Then based on the current language tag (either from 
>the program or from the current locale setting), there is an order
>in which to try encodings.

Sounds worth a look.

>
>We're dropping support for this code and for core X fonts
>in the next release of Pango, 

In favour of what? (FreeType on client side?)

>but if you find it useful, feel
>free to borrow the techniques, tables, generation tools, 
>or table lookup code and use it under whatever license you
>want.
>
>Regards,
>   Owen



Re: Inverse of /\p{script}/

2003-08-29 Thread Owen Taylor
On Fri, 2003-08-29 at 03:07, Nick Ing-Simmons wrote:
> Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
> >On Thu, Aug 28, 2003 at 03:16:20PM +0100, [EMAIL PROTECTED] wrote:
> >> 
> >> Does the existing perl5.8.* Unicode support have a way to efficently 
> >> determine which script(s) or block (in unicode sense) a code point belongs
> >> to?
> >
> > use Unicode::UCD qw(charscript charblock);
> > print charscript(0x0388);
> > print charblock (0x30a0);
> 
> Great.
> 
> 
> >
> >> It seems to make sense to have a hash which maps script names to 
> >> probable (font) encodings 
> >> 
> >>  (Hiragana | Katakana | Han) => 'jisx0208.1990-0'
> >>  (Greek) => 'iso8859-7',  
> >
> >I dunno about script->font mappings...
> 
> That is Tk's (i.e. my) problem.
> XFree86 has the font encodings bundled so I think I can pre-analysze 
> them.

You might want to look at what we did for Pango - see 
pango/modules/basic/tables-big.i in
ftp://ftp.gtk.org/pub/gtk/v2.2/pango-1.2.5.tar.gz.

There is a big map there that for each Unicode codepoint lists
possible encodings with a moderately clever encoding scheme to save
memory. Then based on the current language tag (either from 
the program or from the current locale setting), there is an order
in which to try encodings.

We're dropping support for this code and for core X fonts
in the next release of Pango, but if you find it useful, feel
free to borrow the techniques, tables, generation tools, 
or table lookup code and use it under whatever license you
want.

Regards,
Owen




Re: Inverse of /\p{script}/

2003-08-29 Thread Andreas J Koenig
> On Fri, 29 Aug 2003 11:08:33 +0100, Nick Ing-Simmons <[EMAIL PROTECTED]> said:

  > But cyrillic glyphs are likely double width :-(
  > This is one of reasons I want to do _something_ in this area.
  > I don't want to even try and read a big 16-bit Japanese font 
  > just to get cyrillic (for SPAMer's name) or greek Sigma (for math).

  > The other thing that needs fixing is that Tk currently ignores 
  > any locale information that might be available. So for "unified" ideographs
  > it will use a font that has the character regardless of which "style" it is
  > in. So for Japanese it is quite likely to find a simplified Chinese style
  > font and use that for Han, then when it hits Katakana it will find 
  > an 8-bit (JIS201?) font and use that for those, then when it finds 
  > a Hiragana it will find a JIS 208 font. The result looks a mess even
  > to my occidental eyes.

  > What I am hoping to do for Tk804 is put some kind of callback to perl
  > hook in so that when Tk wants a font for a particular character it 
  > can call to perl and perl will give it strong push in a particular direction.
  > Thus for someone expecting Japanese if asked for a Han character 
  > it will suggest a JIS font. While for someone expecting Chinese it 
  > will suggest a Big5 or gb2312 font as appropriate.

  > What gets really painful is the Unicode fonts - one has to look at 
  > which characters it has to decide if it 
  > Japanese/Simplified Chinese/Traditional Chinese/Korean or just a grab-bag 
  > of glyphs font designer had to hand. 

FWIW, I just found an old posting from the Mozilla developer Katsuhiko
Momoi. He explained:

 ... not every one would tag their Unicode documents with a lang
 tag indicating what language that is. And Mozilla has dependency
 on language for which font glyphs to use. For example, Unicode
 CJK ideographs are not necessarily rendered the same from
 language to language. The same code point may lead to different
 font glyphs dependent on what language it is. Unless every one
 uses a lang tag, I may end up seeing a Japanese document with
 some Chinese glyphs. And I definitely don't want that! (See how
 fonts are set in the preference dialog -- according to language.
 But if language info is not available in the docs, we do our best
 by looking at the charset info -- a charset is a good secondary
 determining factor for some language, e.g. Chinese, Japanese,
 Korean, etc.. Thus, the notion of primary charset is still useful
 in this situation. )

(Cited from http://bugzilla.mozilla.org/show_bug.cgi?id=13393)

Posting this just a pointer to another project that may have developed
helpful code in that area...

-- 
andreas


Re: Inverse of /\p{script}/

2003-08-29 Thread Nick Ing-Simmons
Dan Kogai <[EMAIL PROTECTED]> writes:
>
>But that is not good enough for cases below because...
>
  (Hiragana | Katakana | Han) => 'jisx0208.1990-0'
>
>This is very wrong because jisx0208.1990-0 only contains \p{Han} that 
>appears in Japanese (JIS X 0208, to be exact).  On the other hand, 
>jisx0208.1990-0 does contain greek and cyrillic alphabets.

But cyrillic glyphs are likely double width :-(
This is one of reasons I want to do _something_ in this area.
I don't want to even try and read a big 16-bit Japanese font 
just to get cyrillic (for SPAMer's name) or greek Sigma (for math).

The other thing that needs fixing is that Tk currently ignores 
any locale information that might be available. So for "unified" ideographs
it will use a font that has the character regardless of which "style" it is
in. So for Japanese it is quite likely to find a simplified Chinese style
font and use that for Han, then when it hits Katakana it will find 
an 8-bit (JIS201?) font and use that for those, then when it finds 
a Hiragana it will find a JIS 208 font. The result looks a mess even
to my occidental eyes.

What I am hoping to do for Tk804 is put some kind of callback to perl
hook in so that when Tk wants a font for a particular character it 
can call to perl and perl will give it strong push in a particular direction.
Thus for someone expecting Japanese if asked for a Han character 
it will suggest a JIS font. While for someone expecting Chinese it 
will suggest a Big5 or gb2312 font as appropriate.

What gets really painful is the Unicode fonts - one has to look at 
which characters it has to decide if it 
Japanese/Simplified Chinese/Traditional Chinese/Korean or just a grab-bag 
of glyphs font designer had to hand. 

>
>One of so many reasons why Han Unification was a bad idea.  When it 
>comes to Han Ideographs, Unicode's sense of charscript is almost 
>useless.
>
>\x{5c0f}\x{98fc} \x{5f3e}



Re: Inverse of /\p{script}/

2003-08-29 Thread Dan Kogai
On Friday, Aug 29, 2003, at 16:07 Asia/Tokyo, Nick Ing-Simmons wrote:
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
On Thu, Aug 28, 2003 at 03:16:20PM +0100, [EMAIL PROTECTED] wrote:
Does the existing perl5.8.* Unicode support have a way to efficently
determine which script(s) or block (in unicode sense) a code point 
belongs
to?
use Unicode::UCD qw(charscript charblock);
print charscript(0x0388);
print charblock (0x30a0);
Great.
But that is not good enough for cases below because...

 (Hiragana | Katakana | Han) => 'jisx0208.1990-0'
This is very wrong because jisx0208.1990-0 only contains \p{Han} that 
appears in Japanese (JIS X 0208, to be exact).  On the other hand, 
jisx0208.1990-0 does contain greek and cyrillic alphabets.

One of so many reasons why Han Unification was a bad idea.  When it 
comes to Han Ideographs, Unicode's sense of charscript is almost 
useless.

\x{5c0f}\x{98fc} \x{5f3e}



Re: Inverse of /\p{script}/

2003-08-29 Thread Dan Kogai
On Thursday, Aug 28, 2003, at 23:16 Asia/Tokyo, [EMAIL PROTECTED]  
wrote:
Does the existing perl5.8.* Unicode support have a way to efficently
determine which script(s) or block (in unicode sense) a code point 
belongs
to?

In Unicode-aware Tk I am still doing battle with mechanism to select
X11 font to display a particular codepoint (for now glossing over
glyph vs character issues).
The present code is still rather dumb.
That's what Encode::InCharset is for.  Available via CPAN.

http://search.cpan.org/author/DANKOGAI/Encode-InCharset-0.03/

It seems to make sense to have a hash which maps script names to
probable (font) encodings
 (Hiragana | Katakana | Han) => 'jisx0208.1990-0'
The module makes it \p{InJIS0208} ...

 (Greek) => 'iso8859-7',
And \p{InISO_8859_7}, respectively.

So give a (1 character) string how do I get Unicode script/block it is 
in?
One caveat, however.  It is slightly out of sync w/ the latest Encode.  
You should stay away from vendor encodings that are thoroughly revised 
in Encode 1.75 -> 1.98 (FYI ENcode::InCharset is still based upon 1.75).

Dan the Encode Maintainer






Re: Inverse of /\p{script}/

2003-08-29 Thread Nick Ing-Simmons
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
>On Thu, Aug 28, 2003 at 03:16:20PM +0100, [EMAIL PROTECTED] wrote:
>> 
>> Does the existing perl5.8.* Unicode support have a way to efficently 
>> determine which script(s) or block (in unicode sense) a code point belongs
>> to?
>
>   use Unicode::UCD qw(charscript charblock);
>   print charscript(0x0388);
>   print charblock (0x30a0);

Great.


>
>> It seems to make sense to have a hash which maps script names to 
>> probable (font) encodings 
>> 
>>  (Hiragana | Katakana | Han) => 'jisx0208.1990-0'
>>  (Greek) => 'iso8859-7',  
>
>I dunno about script->font mappings...

That is Tk's (i.e. my) problem.
XFree86 has the font encodings bundled so I think I can pre-analysze 
them.


>
>> So give a (1 character) string how do I get Unicode script/block it is in?