Looking at these East Asian characters in my
email client, IE browser, they are not rendered
as double width, but as a fractional width between 1 and 2
using Courier New font.

3 dashes: too narrow
+---+
|한글─|
+---+

5 dashes: too wide
+-----+
|한글─|
+-----+

Also currently J stubborly wants to draw the box
as if for a UTF-8 sequence, not for Unicode, even after
explicit conversion:

   <7 u:'한글─'
+---------+
|한글─|
+---------+

   datatype 7 u:'한글─'
unicode
   #7 u:'한글─'
3


--- June Kim <[EMAIL PROTECTED]> wrote:

> I'm working on the code.
> 
> In the mean time, here is the code for calculating display width:
> 
> First you need to save the text file at
> http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt
> 
> ===============================================
> require 'regex jfiles'
> t=: 1!:1 <'EastAsianWidth.txt'
> point=:'^([0-9A-F]{4});(Na|N|H|A|W|F)' rxmatches t
> range=:'([0-9A-F]{1,4})\.\.([0-9A-F]{1,4});(Na|N|H|A|W|F)' rxmatches t
> jcreate 'unidatapoint'
> (< }."1 point rxfrom t) jappend 'unidatapoint'
> jcreate 'unidatarange'
> (< }."1 range rxfrom t) jappend 'unidatarange'
> ===============================================
> 
> Now you have unidatapoint.ijf and unidatarange.ijf and are able to use them.
> 
> ===============================================
> require 'jfiles'
> 
> NB. N  : half
> NB. Na : half
> NB. H  : half
> NB. A  : half
> NB. F  : full
> NB. W  : full
> 
> widthcode=:;: 'N Na H A F W'
> pod=:>jread 'unidatapoint';0
> rad=:>jread 'unidatarange';0
> 
> towc=: widthcode&i. NB. towidthcode
> 
> dfh=. 16&#. @ ('0123456789ABCDEF'&i.)
> po=:(dfh each {."1 pod),. <"0 towc"0 {:"1 pod
> ra=:(,&.>/"1 dfh each 2&{."1 rad),. <"0 towc"0 {:"1 rad
> poa=:>{."1 po
> 
> fill=: 4 : 0
>       'r c'=.x
>       r=. ({.r)+ i. >: -~/ r
>       ({.c) r}y
> )
> 
> tab=:65536$0 NB. missing is N
> tab=:(> {:"1 po) poa} tab
> tab=:>./ ra fill"1 tab
> 
> diswid=: [: >: [: 4&<: [: {&tab 3&u:@ucp  NB.for rank 1
> ================================================
> For performance improvement, you could save tab using jfile and use
> it. Also, you could use more compact representation(using 3 bits to
> represent each character and compress the data).
> 
> Usage Example:
>    diswid '한글ab!─'
> 2 2 1 1 1 1
>    (,:~ ((ucp'-') $~ +/@diswid)) ucp '한글ab!-'  NB. properly showing
> the top line in fixed-pitch font
> --------
> 한글ab!-
> 
> 
> 
> 2007/2/13, Eric Iverson <[EMAIL PROTECTED]>:
> > The problem of proper display of boxed unicode data is an interesting
> > one. The first step to getting this fixed is for someone to provide a
> > working J model that takes an arbitrary boxed argument and produces the
> > character stream that properly displays it. If we had such a model we
> > might consider incorporating it into the JE.
> >
> > ----- Original Message -----
> > From: "June Kim" <[EMAIL PROTECTED]>
> > To: "General forum" <[email protected]>
> > Sent: Sunday, February 11, 2007 5:11 AM
> > Subject: Re: [Jgeneral] wd 'set ...' with box draw characters
> >
> >
> > > 2007/2/11, Chris Burke <[EMAIL PROTECTED]>:
> > >> June Kim wrote:
> > > [snip]
> > >> > Second, the box is broken with different width characters(that is,
> > >> > when the length of bytes of the encoding, and the width of the
> > >> > characters on display don't match). What is the usual way of
> > >> > solving
> > >> > it in other programming languages? There is a unicode standard for
> > >> > character widths. http://unicode.org/reports/tr11/
> > >> >
> > >> > Python implements that standard(along with others) in unicodedata
> > >> > module.
> > >> >
> > >> >>>> unicodedata.east_asian_width(u'한')
> > >> > 'W'
> > >> >>>> unicodedata.east_asian_width(u'a')
> > >> > 'Na'
> > >> >
> > >> > (u specifies the following string is unicode. east_asian_width
> > >> > returns
> > >> > the width of the character, not only for east asian characters but
> > >> > all
> > >> > unicode characters; it's got a narrow name due to its history)
> > >> >
> > > [snip]
> > >>
> > >> If you are having problems with display, it is because of the font,
> > >> not
> > >> because we are not using unicode.
> > > [snip]
> > >
> > > When a string is boxed and the string includes characters that have
> > > different width to the byte lenghts, then the box is broken in J. It
> > > is not because of the font. It is because J makes an assumption that
> > > every character's width is same with its byte length, which is
> > > obviously false in many writting+encoding systems, including east
> > > asians. We can definitely say J's box display isn't internationalized
> > > yet.
> > >
> > > For example, 54620 (in unicode code point) is a Korean character,
> > > which is pronounced as "han". It's width is "Wide"(twice wide as latin
> > > alphabets)
> > >
> > >   han=.4 u: 54620
> > >   <han
> > > +---+
> > > |한|
> > > +---+
> > >   <8 u: han
> > > +---+
> > > |한|
> > > +---+
> > >
> > > Since J counts the byte length for determining character's width, and
> > > the byte length for han is 3 in UTF-8( 3-: #8 u: han ), the box's
> > > horizontal character '-'(of which width is "Narrow") is printed three
> > > times, and on the display the box is broken.



 
____________________________________________________________________________________
Want to start your own business?
Learn how on Yahoo! Small Business.
http://smallbusiness.yahoo.com/r-index
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to