Re: [i18n] grep framework.
William J Poser wrote: There is a lot of information on the GNU approach to i18n at: http://www.gnu.org/software/gettext/manual/gettext.html. Yes, that's an excellent place to start. It is based on the design philosophy that underlies the i18n library functions that were defined in the original ANSI C standard. Those were the foundation for both C++ and Java i18n as well. The idea is very simple, and incredibly powerful. Everything that needs to be localized within a program should be contained in an external configuration/database. The particular data is selected based on an externally chosen locale that can be set by the user in some way. It is essentially runtime polymorphism where the particular subclass is chosen externally from the program. All the rest is the details of how its done and what you can do. What gives it most of its power is the fact that new locales and translations of message catalogs can be done without additional changes to the programs that support them. More simply, you don't have to rewrite your code to support a new language. For what it's worth, according to the gettext manual, there is an interface to the gettext library for shell scripts. It's documented here: http://www.gnu.org/software/gettext/manual/html_mono/gettext.html#SEC197 The Bash Reference Manual is similarly terse about how to use it: http://www.gnu.org/software/bash/manual/bashref.html#SEC13 -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (QNX, AIX, Linux), Oracle, Java, Internationalization (i18n), Lisp, HTML, CGI. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [i18n] grep framework.
I would suggest using the same approach that is used within internationalized applications. Look up the strings based on the locale. It would look something like this: if [ -f ${DIRECTORY}/script-strings-${LANG}.sh ]; then . ${DIRECTORY}/script-strings-${LANG}.sh else . ${DIRECTORY}/script-strings-${DEFAULT_LANG}.sh fi # ... grep $GREP_STRING_1 files You will still need to localize the script for each language, but you have moved the localization out of the script logic. Amarendra Godbole wrote: So my question is: how to deal with such scenarios? An immediate solution that strikes me is to use the if loop, something like this: (pseudocode) if LANG = jp; then grep '<>' else grep '<>' endif -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (QNX, AIX, Linux), Oracle, Java, Internationalization (i18n), Lisp, HTML, CGI. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [Bug 2077] - Abiword mangles a UTF-8 file open importing
The default encoding should certainly be based on the locale. However, there are any number of reasons why someone would edit text with multiple encodings. The most obvious is when you are editing something that has been sent to you by someone else who is using a different encoding. For a single language this would probably involve UTF-8 and a single 8-bit encoding. For multilingual people, which includes a sizeable number of open source developers, it could involve multiple 8-bit encodings as well, which make automatic detection impossible. My own preference would be very simple. Assume a default based on the locale, but allow selection of a different encoding on the fly. This is just a gentle suggestion because I am not currently using Abiword. David Starner wrote: > > On Tue, Oct 30, 2001 at 10:54:01AM -0600, [EMAIL PROTECTED] wrote: > > http://bugzilla.abisource.com/show_bug.cgi?id=2077 > > > > --- Additional Comments From [EMAIL PROTECTED] 2001-10-30 10:54 --- > > No, abiword shouldn't assume that your text is UTF-8 just because you're > > running in a UTF-8 locale. > > Huh? That's part of the definition of a locale. Under a locale, the text > encoding is the same as the terminal encoding, which is the same as the > locale encoding. If the text encoding isn't the same as the terminal > encoding, you can't use cat or more or grep or any other console program > without recoding the output to screen. You couldn't redirect output to > disk without recoding it. If the locale encoding differs from both of > them, then what does it mean and why is it useful? Gettext, for one, > uses the locale encoding for the terminal/text encoding. > > If I'm wrong, then someone please clarify, but I don't understand where > you're coming from at all. > > -- > David Starner - [EMAIL PROTECTED] > Pointless website: http://dvdeug.dhis.org > "I saw a daemon stare into my face, and an angel touch my breast; each > one softly calls my name . . . the daemon scares me less." > - "Disciple", Stuart Davis > - > Linux-UTF8: i18n of Linux on all levels > Archive: http://mail.nl.linux.org/linux-utf8/ -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (AIX, Linux), Oracle, Java, Internationalization (i18n), Awk. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: unicode in emacs 21
"H. Peter Anvin" wrote: > Does that mean you're painting yourself into a corner, though, > requiring manual work to integrate the increasingly Unicode-based > infrastructure support that is becoming available? Odds are pretty > good that they are. Since I volunteered to help with this effort, I'd like to know what's already out there. I agree that duplicating functionality in the Emacs code that is already available from supported free libraries would be a bad idea unless there is a compelling reason. Of course, Emacs is buildable on most systems that have a working C compiler and a standard implementation of libc. Depending on anything else, unless it can be imported into the Emacs source tree would be a questionable idea. -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (AIX, Linux), Oracle, Java, Internationalization (i18n), Awk. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: unicode in emacs 21
I haven't meant for anything I've written to indicate that Emacs is not a useful editor for UTF-8 encoded text. I have found it quite usable. I've had a couple of configuration headaches along to way specifically because I am simultaneously maintaining files in both UTF-8 and Latin-3. If the alphabets you use fall within the ranges of characters that Emacs now handles, I can't see any strong argument not to use Emacs. I switched to the prereleases of Emacs 21 a few weeks ago specifically for the Unicode support. For me, there was really no option of choosing anything else, even if I had wanted to. I am doing some heavily customized stuff supported by a pile of Emacs Lisp code tailored to my data over the past 6 1/2 years. Emacs Lisp has saved me hundred of hours. In the end, I would like to see Emacs use Unicode internally. Oliver Doepner wrote: > I was happy to see Emacs 21 announced. but the unicode support does not > seem to have moved forward very much - as i have heard and read from some > people. > > my question: what happened in this area in Emacs 21 ?? Is the internal > representation still the special MULE format ??~ > And are there any plans and/or activities to achieve these things ? -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (AIX, Linux), Oracle, Java, Internationalization (i18n), Awk. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Automatic encoding guessing
Jim Breen wrote: > About 8 years ago when UTF-8 first emerged (it was called UTF-FSS then > ISTR) I noticed that the usual Japanese code detection utilities > invariably thought text containing UTF-8 was in Shift-JIS. So I > wrote my own utility which reliably differentiated between UTF-8, > Shift-JIS, EUC-JP and ISO-2022-JP. It didn't do anything fancy; except that > it reversed the usual policy of looking for codes in a particular set. > Instead it started with all possible sets as candidates and eliminated > them each time it found one that didn't fit. It stopped once there was > only one code left. This approach worked fine for the codes and coding > mentioned above, as each code has a range where it alone is legal. I don't > know how it would go if ISO-8859-* were added to the mix. I must dust it > off and see. I like your solution. However, even that won't solve one problem with the character sets and the way Emacs handles them. 7 bit ASCII is a valid subset of an enormous number of character sets. If you have a file containing only ASCII characters that is intended to be encoded in a more complete character set, it requires input from the user to distinguish which one to use. Now we come to the Emacs issue. The characters in UTF-8 and ISO 8859-x have not been unified in Emacs' internal representation. Thus, to insert the "same character" in UTF-8 or ISO 8859-3 requires inserting a different character into the buffer. Choosing an encoding that isn't compatible with the characters inserted by your input method can be a pain. I've done this more than once with precisely the two character sets I named because I am maintaining files in both encodings. My point is that there isn't a universal solution so long as we want support for the character sets we are discussing. -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (AIX, Linux), Oracle, Java, Internationalization (i18n), Awk. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Automatic encoding guessing
"H. Peter Anvin" wrote: > > Followup to: <[EMAIL PROTECTED]> > By author:David Starner <[EMAIL PROTECTED]> > In newsgroup: linux.utf8 > > > > On Tue, Oct 23, 2001 at 11:05:45AM -0700, H. Peter Anvin wrote: > > > > - ISO 8859 files should be free of C1 and most C0 codes (except > > > > for the usual LF/TAB). > > > > > > I have also had Emacs 20 garble data because of the above assumption > > > :( > > > > What were you editing? Many C0 codes (except CR/LF/TAB/FF/BS/VT) and C1 > > codes are basically binary garbage; an ISO-8859-* document that > > contains them is really more some type of rich text or binary format. > > > > Files with control codes as markup. More common than you seem to > think. For something like that, I could certainly accept that Emacs could not guess the encoding. But I would rather see it ask about unusual cases instead of guessing wrong. -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (AIX, Linux), Oracle, Java, Internationalization (i18n), Awk. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Locales and Emacs 21
David Starner wrote: > > On Tue, Oct 23, 2001 at 09:44:14AM -0400, D. Dale Gulledge wrote: > > is saved the same way. So I guess my question is whether there is > > already a tool out there that will tell me whether a file is UTF-8 or > > ISO 8859. > > Recent versions of file. It's not 100%, but unless all you have is > Nestle'(r) (in iso-8859-1) in the file, it should get it right. I tried it. For the files I have `file' correctly distinguished between UTF-8 and 8859. It is not able to determine which flavor of 8859, but I expected that. -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (AIX, Linux), Oracle, Java, Internationalization (i18n), Awk. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: LT: Great news!
oliver doepner wrote: > > == > This message was posted as a talkback at >/news_story.php3?ltsn=2001-10-22-010-20-NW-SW > == > Hi, > I am happy to see GNU Emacs moving forward this > way. I will try it soon. Thanks for the work of > all the developers!! > > What about the Unicode support ? > > I was really waiting for the MULE-UCS sort of stuff > to become a core part of my favourite Editor. I heard > that the internal MULE representation scheme was to > be replaced by UTF-8 ?! I've been using Emacs 21 for the past couple of pre-releases and I have recently converted a substantial amount of my work from Latin-3 to UTF-8. The internal representation is still the same as it was. One of the downsides of that is that the characters from the ISO 8859 character sets other than 8859-1 (Latin-1) have not been unified with the Unicode characters. Thus, the "same character" from 8859-3 (Latin-3) and UTF-8 is not the same character internally. That causes two problems. First, your input mode must produce the correct characters. I use a version of latin-ltx that I modified to use the latin-3-prefix key sequences for input. The other problem is that characters from an 8859-x buffer (other than 8859-1) and the same characters from a UTF-8 buffer don't cut and paste. However, I have been successfully using Latin-3 for some older stuff (.po files from the Translation Project) alongside UTF-8 for other files. I have occasionally tripped myself up, but it works pretty well. If you have specific questions, I be happy to take a shot at answering them. -- D. Dale Gulledge, Sr. Programmer, [EMAIL PROTECTED] C, C++, Perl, Unix (AIX, Linux), Oracle, Java, Internationalization (i18n), Awk. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Sorry for broken UTF-8 files recently...
"H. Peter Anvin" wrote: > I just noticed I have managed to post a bunch of broken UTF-8 messages > recently. It looks like Emacs has looked on the UTF-8 sequences and > decided they're not an encoding it understands, so it's kindly decided > to "correct" them for me. *Sigh.* Anyone knows how to tell Emacs > 20.x to stop trying to do anything but let me enter my own bytes, > until an UTF-8 capable version of Emacs rolls around? I'm using oc-unicode and Mule-UCS with fairly good success,but I haven't been using Emacs for e-mail. However, on a quick test a minute ago, I discovered that using set-buffer-file-coding-system to utf-8 and set-input-method to latin-3 appears to produce a message encoded in latin-3, which is my default encoding. The message is readable, but not what I had hoped for. Peter, I'm guessing that your problem is with: > These symbols can be combined to form an upper case Greek letter > sigma (?Î?£, U+03A3) of any square size, 2x2 or larger, on a monospaced > terminal. One possible source of you problem unrelated to Emacs may be this: > X-MIME-Autoconverted: from 8bit to quoted-printable by deepthought.transmeta.com id >VAA29898 - Dale - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/