Looking at these East Asian characters in my email client, IE browser, they are not rendered as double width, but as a fractional width between 1 and 2 using Courier New font.
3 dashes: too narrow +---+ |íê¸â| +---+ 5 dashes: too wide +-----+ |íê¸â| +-----+ Also currently J stubborly wants to draw the box as if for a UTF-8 sequence, not for Unicode, even after explicit conversion: <7 u:'íê¸â' +---------+ |íê¸â| +---------+ datatype 7 u:'íê¸â' unicode #7 u:'íê¸â' 3 --- June Kim <[EMAIL PROTECTED]> wrote: > I'm working on the code. > > In the mean time, here is the code for calculating display width: > > First you need to save the text file at > http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt > > =============================================== > require 'regex jfiles' > t=: 1!:1 <'EastAsianWidth.txt' > point=:'^([0-9A-F]{4});(Na|N|H|A|W|F)' rxmatches t > range=:'([0-9A-F]{1,4})\.\.([0-9A-F]{1,4});(Na|N|H|A|W|F)' rxmatches t > jcreate 'unidatapoint' > (< }."1 point rxfrom t) jappend 'unidatapoint' > jcreate 'unidatarange' > (< }."1 range rxfrom t) jappend 'unidatarange' > =============================================== > > Now you have unidatapoint.ijf and unidatarange.ijf and are able to use them. > > =============================================== > require 'jfiles' > > NB. N : half > NB. Na : half > NB. H : half > NB. A : half > NB. F : full > NB. W : full > > widthcode=:;: 'N Na H A F W' > pod=:>jread 'unidatapoint';0 > rad=:>jread 'unidatarange';0 > > towc=: widthcode&i. NB. towidthcode > > dfh=. 16&#. @ ('0123456789ABCDEF'&i.) > po=:(dfh each {."1 pod),. <"0 towc"0 {:"1 pod > ra=:(,&.>/"1 dfh each 2&{."1 rad),. <"0 towc"0 {:"1 rad > poa=:>{."1 po > > fill=: 4 : 0 > 'r c'=.x > r=. ({.r)+ i. >: -~/ r > ({.c) r}y > ) > > tab=:65536$0 NB. missing is N > tab=:(> {:"1 po) poa} tab > tab=:>./ ra fill"1 tab > > diswid=: [: >: [: 4&<: [: {&tab 3&u:@ucp NB.for rank 1 > ================================================ > For performance improvement, you could save tab using jfile and use > it. Also, you could use more compact representation(using 3 bits to > represent each character and compress the data). > > Usage Example: > diswid 'íê¸ab!â' > 2 2 1 1 1 1 > (,:~ ((ucp'-') $~ +/@diswid)) ucp 'íê¸ab!-' NB. properly showing > the top line in fixed-pitch font > -------- > íê¸ab!- > > > > 2007/2/13, Eric Iverson <[EMAIL PROTECTED]>: > > The problem of proper display of boxed unicode data is an interesting > > one. The first step to getting this fixed is for someone to provide a > > working J model that takes an arbitrary boxed argument and produces the > > character stream that properly displays it. If we had such a model we > > might consider incorporating it into the JE. > > > > ----- Original Message ----- > > From: "June Kim" <[EMAIL PROTECTED]> > > To: "General forum" <[email protected]> > > Sent: Sunday, February 11, 2007 5:11 AM > > Subject: Re: [Jgeneral] wd 'set ...' with box draw characters > > > > > > > 2007/2/11, Chris Burke <[EMAIL PROTECTED]>: > > >> June Kim wrote: > > > [snip] > > >> > Second, the box is broken with different width characters(that is, > > >> > when the length of bytes of the encoding, and the width of the > > >> > characters on display don't match). What is the usual way of > > >> > solving > > >> > it in other programming languages? There is a unicode standard for > > >> > character widths. http://unicode.org/reports/tr11/ > > >> > > > >> > Python implements that standard(along with others) in unicodedata > > >> > module. > > >> > > > >> >>>> unicodedata.east_asian_width(u'í') > > >> > 'W' > > >> >>>> unicodedata.east_asian_width(u'a') > > >> > 'Na' > > >> > > > >> > (u specifies the following string is unicode. east_asian_width > > >> > returns > > >> > the width of the character, not only for east asian characters but > > >> > all > > >> > unicode characters; it's got a narrow name due to its history) > > >> > > > > [snip] > > >> > > >> If you are having problems with display, it is because of the font, > > >> not > > >> because we are not using unicode. > > > [snip] > > > > > > When a string is boxed and the string includes characters that have > > > different width to the byte lenghts, then the box is broken in J. It > > > is not because of the font. It is because J makes an assumption that > > > every character's width is same with its byte length, which is > > > obviously false in many writting+encoding systems, including east > > > asians. We can definitely say J's box display isn't internationalized > > > yet. > > > > > > For example, 54620 (in unicode code point) is a Korean character, > > > which is pronounced as "han". It's width is "Wide"(twice wide as latin > > > alphabets) > > > > > > han=.4 u: 54620 > > > <han > > > +---+ > > > |í| > > > +---+ > > > <8 u: han > > > +---+ > > > |í| > > > +---+ > > > > > > Since J counts the byte length for determining character's width, and > > > the byte length for han is 3 in UTF-8( 3-: #8 u: han ), the box's > > > horizontal character '-'(of which width is "Narrow") is printed three > > > times, and on the display the box is broken. ____________________________________________________________________________________ Want to start your own business? Learn how on Yahoo! Small Business. http://smallbusiness.yahoo.com/r-index ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
