Re: [i18n] grep framework.

2005-10-29 Thread D. Dale Gulledge

William J Poser wrote:

There is a lot of information on the GNU approach to i18n
at: http://www.gnu.org/software/gettext/manual/gettext.html.


Yes, that's an excellent place to start.  It is based on the design 
philosophy that underlies the i18n library functions that were defined 
in the original ANSI C standard.  Those were the foundation for both C++ 
and Java i18n as well.


The idea is very simple, and incredibly powerful.  Everything that needs 
to be localized within a program should be contained in an external 
configuration/database.  The particular data is selected based on an 
externally chosen locale that can be set by the user in some way.


It is essentially runtime polymorphism where the particular subclass is 
chosen externally from the program.  All the rest is the details of how 
its done and what you can do.  What gives it most of its power is the 
fact that new locales and translations of message catalogs can be done 
without additional changes to the programs that support them.  More 
simply, you don't have to rewrite your code to support a new language.


For what it's worth, according to the gettext manual, there is an 
interface to the gettext library for shell scripts.  It's documented here:


http://www.gnu.org/software/gettext/manual/html_mono/gettext.html#SEC197

The Bash Reference Manual is similarly terse about how to use it:

http://www.gnu.org/software/bash/manual/bashref.html#SEC13

--
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (QNX, AIX, Linux), Oracle, Java,
Internationalization (i18n), Lisp, HTML, CGI.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: [i18n] grep framework.

2005-10-27 Thread D. Dale Gulledge
I would suggest using the same approach that is used within 
internationalized applications.  Look up the strings based on the 
locale.  It would look something like this:


if [ -f ${DIRECTORY}/script-strings-${LANG}.sh ]; then
  . ${DIRECTORY}/script-strings-${LANG}.sh
else
  . ${DIRECTORY}/script-strings-${DEFAULT_LANG}.sh
fi

# ...

grep $GREP_STRING_1 files

You will still need to localize the script for each language, but you 
have moved the localization out of the script logic.


Amarendra Godbole wrote:


So my question is: how to deal with such scenarios? An immediate
solution that strikes me is to use the if loop, something like
this: (pseudocode)
if LANG = jp; then
  grep '<>'
else
  grep '<>'
endif


--
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (QNX, AIX, Linux), Oracle, Java,
Internationalization (i18n), Lisp, HTML, CGI.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: [Bug 2077] - Abiword mangles a UTF-8 file open importing

2001-10-30 Thread D. Dale Gulledge

The default encoding should certainly be based on the locale.  However,
there are any number of reasons why someone would edit text with
multiple encodings.  The most obvious is when you are editing something
that has been sent to you by someone else who is using a different
encoding.  For a single language this would probably involve UTF-8 and a
single 8-bit encoding.  For multilingual people, which includes a
sizeable number of open source developers, it could involve multiple
8-bit encodings as well, which make automatic detection impossible.

My own preference would be very simple.  Assume a default based on the
locale, but allow selection of a different encoding on the fly.  This is
just a gentle suggestion because I am not currently using Abiword.

David Starner wrote:
> 
> On Tue, Oct 30, 2001 at 10:54:01AM -0600, [EMAIL PROTECTED] wrote:
> > http://bugzilla.abisource.com/show_bug.cgi?id=2077
> >
> > --- Additional Comments From [EMAIL PROTECTED]  2001-10-30 10:54 ---
> > No, abiword shouldn't assume that your text is UTF-8 just because you're
> > running in a UTF-8 locale.
> 
> Huh? That's part of the definition of a locale. Under a locale, the text
> encoding is the same as the terminal encoding, which is the same as the
> locale encoding. If the text encoding isn't the same as the terminal
> encoding, you can't use cat or more or grep or any other console program
> without recoding the output to screen. You couldn't redirect output to
> disk without recoding it. If the locale encoding differs from both of
> them, then what does it mean and why is it useful? Gettext, for one,
> uses the locale encoding for the terminal/text encoding.
> 
> If I'm wrong, then someone please clarify, but I don't understand where
> you're coming from at all.
> 
> --
> David Starner - [EMAIL PROTECTED]
> Pointless website: http://dvdeug.dhis.org
> "I saw a daemon stare into my face, and an angel touch my breast; each
> one softly calls my name . . . the daemon scares me less."
> - "Disciple", Stuart Davis
> -
> Linux-UTF8:   i18n of Linux on all levels
> Archive:  http://mail.nl.linux.org/linux-utf8/

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-27 Thread D. Dale Gulledge

"H. Peter Anvin" wrote:

> Does that mean you're painting yourself into a corner, though,
> requiring manual work to integrate the increasingly Unicode-based
> infrastructure support that is becoming available?  Odds are pretty
> good that they are.

Since I volunteered to help with this effort, I'd like to know what's
already out there.  I agree that duplicating functionality in the Emacs
code that is already available from supported free libraries would be a
bad idea unless there is a compelling reason.  Of course, Emacs is
buildable on most systems that have a working C compiler and a standard
implementation of libc.  Depending on anything else, unless it can be
imported into the Emacs source tree would be a questionable idea.

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: unicode in emacs 21

2001-10-25 Thread D. Dale Gulledge

I haven't meant for anything I've written to indicate that Emacs is not
a useful editor for UTF-8 encoded text.  I have found it quite usable. 
I've had a couple of configuration headaches along to way specifically
because I am simultaneously maintaining files in both UTF-8 and Latin-3.

If the alphabets you use fall within the ranges of characters that Emacs
now handles, I can't see any strong argument not to use Emacs.  I
switched to the prereleases of Emacs 21 a few weeks ago specifically for
the Unicode support.  For me, there was really no option of choosing
anything else, even if I had wanted to.  I am doing some heavily
customized stuff supported by a pile of Emacs Lisp code tailored to my
data over the past 6 1/2 years.  Emacs Lisp has saved me hundred of
hours.

In the end, I would like to see Emacs use Unicode internally.

Oliver Doepner wrote:

> I was happy to see Emacs 21 announced. but the unicode support does not
> seem to have moved forward very much - as i have heard and read from some
> people.
> 
> my question: what happened in this area in Emacs 21 ?? Is the internal
> representation still the special MULE format ??~
> And are there any plans and/or activities to achieve these things ?

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Automatic encoding guessing

2001-10-24 Thread D. Dale Gulledge

Jim Breen wrote:

> About 8 years ago when UTF-8 first emerged (it was called UTF-FSS then
> ISTR) I noticed that the usual Japanese code detection utilities
> invariably thought text containing UTF-8 was in Shift-JIS. So I
> wrote my own utility which reliably differentiated between UTF-8,
> Shift-JIS, EUC-JP and ISO-2022-JP. It didn't do anything fancy; except that
> it reversed the usual policy of looking for codes in a particular set.
> Instead it started with all possible sets as candidates and eliminated
> them each time it found one that didn't fit. It stopped once there was
> only one code left. This approach worked fine for the codes and coding
> mentioned above, as each code has a range where it alone is legal. I don't
> know how it would go if ISO-8859-* were added to the mix. I must dust it
> off and see.

I like your solution.  However, even that won't solve one problem with
the character sets and the way Emacs handles them.  7 bit ASCII is a
valid subset of an enormous number of character sets.  If you have a
file containing only ASCII characters that is intended to be encoded in
a more complete character set, it requires input from the user to
distinguish which one to use.  Now we come to the Emacs issue.  The
characters in UTF-8 and ISO 8859-x have not been unified in Emacs'
internal representation.  Thus, to insert the "same character" in UTF-8
or ISO 8859-3 requires inserting a different character into the buffer. 
Choosing an encoding that isn't compatible with the characters inserted
by your input method can be a pain.  I've done this more than once with
precisely the two character sets I named because I am maintaining files
in both encodings.

My point is that there isn't a universal solution so long as we want
support for the character sets we are discussing.

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Automatic encoding guessing

2001-10-24 Thread D. Dale Gulledge

"H. Peter Anvin" wrote:
> 
> Followup to:  <[EMAIL PROTECTED]>
> By author:David Starner <[EMAIL PROTECTED]>
> In newsgroup: linux.utf8
> >
> > On Tue, Oct 23, 2001 at 11:05:45AM -0700, H. Peter Anvin wrote:
> > > >   - ISO 8859 files should be free of C1 and most C0 codes (except
> > > > for the usual LF/TAB).
> > >
> > > I have also had Emacs 20 garble data because of the above assumption
> > > :(
> >
> > What were you editing? Many C0 codes (except CR/LF/TAB/FF/BS/VT) and C1
> > codes are basically binary garbage; an ISO-8859-* document that
> > contains them is really more some type of rich text or binary format.
> >
> 
> Files with control codes as markup.  More common than you seem to
> think.

For something like that, I could certainly accept that Emacs could not
guess the encoding.  But I would rather see it ask about unusual cases
instead of guessing wrong.

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Locales and Emacs 21

2001-10-24 Thread D. Dale Gulledge

David Starner wrote:
> 
> On Tue, Oct 23, 2001 at 09:44:14AM -0400, D. Dale Gulledge wrote:
> > is saved the same way.  So I guess my question is whether there is
> > already a tool out there that will tell me whether a file is UTF-8 or
> > ISO 8859.
> 
> Recent versions of file. It's not 100%, but unless all you have is
> Nestle'(r) (in iso-8859-1) in the file, it should get it right.

I tried it.  For the files I have `file' correctly distinguished between
UTF-8 and 8859.  It is not able to determine which flavor of 8859, but I
expected that.

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: LT: Great news!

2001-10-22 Thread D. Dale Gulledge

oliver doepner wrote:
> 
> ==
> This message was posted as a talkback at 
>/news_story.php3?ltsn=2001-10-22-010-20-NW-SW
> ==
> Hi,
> I am happy to see GNU Emacs moving forward this
> way. I will try it soon. Thanks for the work of
> all the developers!!
> 
> What about the Unicode support ?
> 
> I was really waiting for the MULE-UCS sort of stuff
> to become a core part of my favourite Editor. I heard
> that the internal MULE representation scheme was to
> be replaced by UTF-8 ?!

I've been using Emacs 21 for the past couple of pre-releases and I have
recently converted a substantial amount of my work from Latin-3 to
UTF-8.  The internal representation is still the same as it was.  One of
the downsides of that is that the characters from the ISO 8859 character
sets other than 8859-1 (Latin-1) have not been unified with the Unicode
characters.  Thus, the "same character" from 8859-3 (Latin-3) and UTF-8
is not the same character internally.  That causes two problems.  First,
your input mode must produce the correct characters.  I use a version of
latin-ltx that I modified to use the latin-3-prefix key sequences for
input.  The other problem is that characters from an 8859-x buffer
(other than 8859-1) and the same characters from a UTF-8 buffer don't
cut and paste.

However, I have been successfully using Latin-3 for some older stuff
(.po files from the Translation Project) alongside UTF-8 for other
files.  I have occasionally tripped myself up, but it works pretty
well.  If you have specific questions, I be happy to take a shot at
answering them.

-- 
D. Dale Gulledge, Sr. Programmer,
[EMAIL PROTECTED]
C, C++, Perl, Unix (AIX, Linux), Oracle, Java,
Internationalization (i18n), Awk.
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/



Re: Sorry for broken UTF-8 files recently...

2001-06-05 Thread D. Dale Gulledge

"H. Peter Anvin" wrote:

> I just noticed I have managed to post a bunch of broken UTF-8 messages
> recently.  It looks like Emacs has looked on the UTF-8 sequences and
> decided they're not an encoding it understands, so it's kindly decided
> to "correct" them for me.  *Sigh.*  Anyone knows how to tell Emacs
> 20.x to stop trying to do anything but let me enter my own bytes,
> until an UTF-8 capable version of Emacs rolls around?

I'm using oc-unicode and Mule-UCS with fairly good success,but I
haven't been using Emacs for e-mail.  However, on a quick test a
minute ago, I discovered that using set-buffer-file-coding-system to
utf-8 and set-input-method to latin-3 appears to produce a message
encoded in latin-3, which is my default encoding.  The message is
readable, but not what I had hoped for.

Peter, I'm guessing that your problem is with:

> These symbols can be combined to form an upper case Greek letter
> sigma (?Î?£, U+03A3) of any square size, 2x2 or larger, on a monospaced
> terminal.

One possible source of you problem unrelated to Emacs may be this:

> X-MIME-Autoconverted: from 8bit to quoted-printable by deepthought.transmeta.com id 
>VAA29898

- Dale
-
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/