Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-23 Thread Asmus Freytag
At 11:13 AM 4/23/2004, Philippe Verdy wrote: On Fri, 23 Apr 2004 12:12:57 -0400, "Edward H. Trager" <[EMAIL PROTECTED]> said: > 2 -- doing everything from regular windows gui tools, which have been > unicode-freindly since forever. Maybe on Windows based on newer NT kernels only (NT4, 2000, XP, 200

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-23 Thread Edward H. Trager
On Friday 2004.04.23 13:57:56 -0400, [EMAIL PROTECTED] wrote: > Edward H. Trager scripsit: > > > (Windows' lack of a decent shell and command-line tools is probably > > what makes the OS most annoying). > > Cygwin (http://www.cygwin.com) is your friend; it provides a relatively > complete Unix h

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-23 Thread Philippe Verdy
On Fri, 23 Apr 2004 12:12:57 -0400, "Edward H. Trager" <[EMAIL PROTECTED]> said: > 2 -- doing everything from regular windows gui tools, which have been > unicode-freindly since forever. Maybe on Windows based on newer NT kernels only (NT4, 2000, XP, 2003, ...). This sentence ("since forever") is

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-23 Thread jcowan
Edward H. Trager scripsit: > (Windows' lack of a decent shell and command-line tools is probably > what makes the OS most annoying). Cygwin (http://www.cygwin.com) is your friend; it provides a relatively complete Unix hosted on Win32. It works best on the NT branch of the family when the disks

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-23 Thread Edward H. Trager
On Friday 2004.04.23 09:11:30 -0700, Benjamin Peterson wrote: > > On Fri, 23 Apr 2004 12:12:57 -0400, "Edward H. Trager" > <[EMAIL PROTECTED]> said: > > > There is an issue that you might confront with these terminal-based tools > > on > > Windows and on Mac OSX that I myself don't know how to so

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-23 Thread Benjamin Peterson
On Fri, 23 Apr 2004 12:12:57 -0400, "Edward H. Trager" <[EMAIL PROTECTED]> said: > There is an issue that you might confront with these terminal-based tools > on > Windows and on Mac OSX that I myself don't know how to solve, and that is > that > I don't know how to switch to a UTF-8 locale on ei

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-23 Thread Tom Emerson
Edward H. Trager writes: > Perhaps someone else on this list can tell us how to get Apple's terminal application > or xterm running on OS X to display UTF-8 characters correctly[...] This is trivial in the terminal: 1. Select "Window Settings" from the "Terminal" menu. 2. Select "Display" from t

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-23 Thread Edward H. Trager
I've been following this thread initiated by Raymond Mercier's comments on the Unihan database with some slight amusement but mostly dismay that some readers of this list are using the completely wrong software tools for dealing with a *database* file like the Unihan table. My sincerest advice t

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-22 Thread Rick McGowan
> I've never managed to get either Notepad or Word to open Unihan.txt Just use EMACS. Works fine. Rick

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-21 Thread Doug Ewell
Raymond Mercier wrote: > The problem of the size of Unihan has nothing at all to do with the > cost of storage, and everything to do with the functioning of programs > that might open and read it. > Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, > this means that when opened

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-21 Thread Andrew C. West
On Tue, 20 Apr 2004 22:36:48 +0100, "Raymond Mercier" wrote: > > The problem of the size of Unihan has nothing at all to do with the cost of > storage, and everything to do with the functioning of programs that might > open and read it. > Since the lines in Unihan are separated by 0x0A alone, not

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-21 Thread Philippe Verdy
From: "John Cowan" <[EMAIL PROTECTED]> To: "Raymond Mercier" <[EMAIL PROTECTED]> > Raymond Mercier scripsit: > > > Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this > > means that when opened in notepad the lines are not separated. Notepad does > > have the advantage that

RE: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread Tom Emerson
Unihan is designed, first and foremost, to be a _data_ file for consumption by software. It doesn't matter at all how many spaces are used for the tabs. The use of tabs make it trivial to write scfipts to parse the file with grep, awk, Perl, Python. With regards to the Pinyin orthography: tone num

RE: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread Mike Ayers
Title: RE: Unihan.txt and the four dictionary sorting algorithm > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of John Jenkins > Sent: Tuesday, April 20, 2004 6:40 PM > > The tab "character" is used in the file.  Arguably, this > "charac

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread John Jenkins
On Apr 20, 2004, at 5:11 PM, [EMAIL PROTECTED] wrote: The DOS editor chokes on such a large text file, so does my older hex editor. Thank goodness for BabelPad, otherwise it would've been hard to insert proper (for my system) line breaks into the file. BBEdit on the Mac tends to be unhappy with i

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread jameskass
Raymond Mercier wrote, > John Jenkins writes > >>Also, even though the full Unihan database is 25+ Mb in size, given the > cheapness of disk space nowadays, it's not all *that* big, surely. > << > > The problem of the size of Unihan has nothing at all to do with the cost of > storage, and everyt

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread John Cowan
Raymond Mercier scripsit: > Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this > means that when opened in notepad the lines are not separated. Notepad does > have the advantage that the UTF-8 encoding is recognized, and the characters > are displayed. Changing to a line te

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread D. Starner
"Raymond Mercier" <[EMAIL PROTECTED]> writes: > The problem of the size of Unihan has nothing at all to do with the cost of > storage, and everything to do with the functioning of programs that might > open and read it. It's a data file stored as a text file to be simple; it's not designed

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread Raymond Mercier
John Jenkins writes >>Also, even though the full Unihan database is 25+ Mb in size, given the cheapness of disk space nowadays, it's not all *that* big, surely. << The problem of the size of Unihan has nothing at all to do with the cost of storage, and everything to do with the functioning of prog

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread John Jenkins
On Apr 19, 2004, at 8:40 PM, Ernest Cline wrote: For example, if there is a value of kIRGKungXi of the form .YY0 there will always be the same value for the kKangXi for that character and vice versa. This is not a safe assumption. There are 37 cases where the kIRGKangXi field ends in 0 but t

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-20 Thread Raymond Mercier
Ernest Cline writes >>I'm trying to pare Unihan.txt down to a less unwieldy size for my own use by eliminating properties that are of no interest to me << The sheer size of unihan creates problems, hence the need to extract manageable subsets. This is the basis of my Hanfind: (http://ourworld.

Unihan.txt and the four dictionary sorting algorithm

2004-04-19 Thread Ernest Cline
While I would expect the answer to my question to be true, one never knows what lurks in the heart of data files. Unihan.txt contains at least two properties for each of the four dictionaries used in the sorting algorithm. One property contains only characters that are actually in the dictionary