Re: [PLUG] Ascii versus UTF-8 woes

Loren M. Lang Sun, 28 Dec 2025 23:47:40 -0800

On Fri, Dec 26, 2025 at 09:04:53AM -0800, Ted Mittelstaedt wrote:
> Not really surprising.
> 
> Copying and pasting into vim is a no-go because the distributors of vim 
> decided when they coded up a rip-off of the actual 'vi' command to add
> In UTF-8 support - even though the entire command-line terminal environment 
> that vim is used in - is really an ASCII environment NOT a UTF-8 environment.

Copying and pasting with full Unicode support works just fine in Vim
when things are configured correctly. I rely on this daily and don't
usually have to think about it anymore. The problem is that many
programs like Vim default to very legacy settings for compatibility if
not configured otherwise. Some distributions like Fedora will often do a
lot more to configure and modernize the environment whereas other
distributions like Arch and Slackware expect the end-user to configure
everything themselves. I don't rely on my distribution to configure
anything and I have all my core home directory rc files in a Git
repository that I clone to every new system I work on. It has all the
customizations in .bashrc, .vimrc, .muttrc, etc. to ensure everything is
properly enabled and my history goes back 20 years now when I first took
the effort to make sure everything is enabled for UTF-8 by default even
if I am stuck on a system configured for the C locale or other legacy
encoding.

> 
> It's important to understand that vim IS NOT vi.  This is a common 
> misperception by newcomers to vi   (which, in my personal option, is the 
> greatest
> Text editor ever invented)

I am just young enough to have been introduced to Vim and my only Vi
editor was Vim when I started it in Vi-compatible mode in the early
days. Well, apart from a short stent playing on Solaris 2.6...

> I had intended you copy and paste into the GUI text editor that comes with 
> Linux since you were copying and pasting from a web page - I had not assumed 
> you were running the command-line version of a web browser :-)   As you 
> discovered notepadqq also supports the UTF-8 stuff but it at least 
> understands when it writes out a textfile that "text" means ascii.

In my terminal environment, I am primarily working in UTF-8 and not
ASCII. This requires two main things from the terminal, one is that I
have the appropriate environment variables set which means I have this
in my .bashrc which is also sourced into my .bash_profile:

export LANG=en_US.UTF-8

And it means that my terminal, whether Gnome Terminal, Konsole, or
PuTTY, is configured for the character encoding of UTF-8. If I selected
a UTF-8 locale when installing Debian then both of these items are
normally already completed.

With that, I just have to make sure that Vim is configured in
non-compatible mode and has the following lines:

set fileencodings=ucs-bom,utf-8,default,cp1252
set encoding=utf-8
set fileformats=unix,dos

The first line means that if it detects a text file with the UTF-16 BOM
character leading the file (a common occurrence with files saved from
Windows Notepad), it will auto-concert to/from UTF-8 on load/save. If
not, it will try to load it as UTF-8, but if that fails, it will try
latin1 followed by Windows-1252, another legacy encoding saved by
Windows Notepad when saving as "ANSI" but with certain accented
characters that are from the Windows-1252 encoding. This allows me to
load a lot of files without ever thinking much about the encoding and it
just works. While working in the editor, it is UTF-8 and since my
terminal is in UTF-8, it all display correctly. It also converts back on
save.

Oh, and the last line just lets it detect UNIX or MS-DOS (Windows)
line-endings which is auto-converts back and forth on load/save.

As for copy and paste, as long as the terminal and Vim are in UTF-8,
then copy/paste should work as well. This is true of both using the
terminal's built-in copy/paste and with the X11 clipboard access. If you
are running a version of Vim compiled with X11 support in a terminal and
your $DISPLAY variable is set appropriately, then Vim can access the X11
clipboard by copying/pasting from "+. It will work with full Unicode
support as long as your X11 environment is properly configured for
UTF-8.

As a test, I put together a little webpage writing it in Vim inside a
terminal over a remote SSH connection to a server and using X11
forwarding with SSH to paste from the clipboard in the X11 session of my
laptop. I found a wide variety of Unicode characters and emoji and that
pasted just fine into Vim and also display correctly inside the terminal
session. I then save the HTML webpage and made sure that it properly
declared itself as UTF-8 in the HTML headers to the browser and the page
rendered in Firefox as expected. As a further test, I also opened that
same webpage in all of the 4 classic text-mode browsers on Debian with
mostly success. The elinks and lynx browsers rendered nearly everything
except for the Emojis using a special joiner control character that
combines two emojis into one symbol, but nearly everything else worked
right down to the colorized Emoji font that I have installed and is
accessible to my terminal. links and links2 had more trouble and did not
render most of the more exotic characters, but still rendered the simple
arrow symbols and accented characters. This is just a limitation in
their specific Unicode support they have built-in. If you want to take a
look yourself, here's my test page:

http://www.north-winds.org/emoji.html

> 
> There's an interesting discussion of the conversion problems here with some 
> suggestions you could use at the command line:
> 
> https://unix.stackexchange.com/questions/171832/converting-a-utf-8-file-to-ascii-best-effort

There are some interesting suggestions on this if you really have to
deal with an environment that can't handle non-ASCII 

> 
> One commentor recommended this program:
> 
> https://manpages.ubuntu.com/manpages/jammy/man1/konwert.1.html
> 
> I know that this is going to sound terribly privileged and nationalistic but 
> the fact is that the UNIX operating system was invented in the United States 
> not in any other country, and the simple reality is that every other country 
> has had the same access to electronics knowledge and scientific information 
> since the invention of the vacuum tube - but every other government and 
> culture on the fact of the Earth pretty much didn't value any of that "tech 
> stuff" until AFTER us Americans invented it.  And NOW, they all want a piece 
> of the action.  Well OK maybe if they all had valued open information, the 
> free exchange of ideas, scientific advancement, much more than they valued 
> dictatorial socio-religious crap used to tell people what to do and how to 
> live and who to screw, then MAYBE they would have gotten to the digital age 
> FIRST and then maybe us Americans would have to learn Chinese if we wanted to 
> write software.  (there's a reason the Americans using stone knives and 
> bearskins made it to the moon and back and the Chinese today even though they 
> manufacture tech that would knock 1969 NASA tech into a cocked hat - still 
> haven't made it there) Get me drift, here?

Indeed, I will agree that the entire attempt to add in multi-byte
encodings into POSIX has been a big hack, but it's a hack that's now
over 30 years old and pretty well understood at this point. My favorite
example is the first double-byte encodings like EUC-JP. They have a mix
of single and double-byte characters which can be properly understood
when reading left-to-right. Now, one of the tricks was that all
double-byte characters are also displayed in a double-wide font. This
both helps with many terminal characters that use strlen() to count the
length of a string to determine how much screen real estate it will take
up and also allows for the more complex rendering needed for those
Japanese characters. It indeed very much feels like a hack, but that is
now life.

> 
> UTF-8 was tacked on to UNIX as a way of accommodating the rest of the world 
> who frankly couldn't give a tinker's damn about the digital age - until we 
> Americans started kicking their butts with it.  So it's NEVER going to be 
> completely fully integrated into the Linux experience the way ASCII is.  If, 
> you, Randall, are stuck having to deal with that interface of American 
> computing to rest of the world computing - your going to always have to deal 
> with this fundamental mismatch.

UTF-8 is even trickier because it can be anywhere from 1-4 bytes in
length even if it only takes up one character cell on the screen so it's
now imperative to use functions like wcswidth to find the needed screen
real estate instead of counting bytes like strlen() does, but any
efforts beyond the legacy 8-bit character sets will require extra effort
to implement. UTF-8 is actually quite clean in several ways for being
such a large character set. For one, it's much easier to go backwards
such as when implementing backspace support. A character with a leading
0-bit will be a valid single-byte US-ASCII character in the range 0x00
to 0x7f. A character with two leading high bits, in other words, 0xc0 to
0xff, will be the first byte in a multi-byte character. In fact, the
precise range of it will tell you immediately if this is a 2, 3, or 4
byte sequence so you can skip forward if needed. All later bytes in such
a sequence will have a leading 1 and 0, or 0x80 to 0xbf in hex. To
backspace, if the right-most byte is in the 0x80 to 0xbf, then keep
deleting bytes until you find the 0xc0 to 0xff byte and remove that as
the final byte in the backspace. Many other legacy multi-byte encodings
have more complicated rules.

> 
> What I find most interesting in all of this is that the tech types in the 
> REST of the world fully accept this - THEY are NOT in general the ones 
> complaining about the second-class citizen status of UTF-8.  They know that 
> they came second, they know they came in second because the majority of 
> people in their culture don't value freedom of choice, and all that other 
> stuff needed for scientific advancement, and they accept that their native 
> languages play second fiddle to ASCII.  They type "rm" and "ls" and all the 
> other ASCII commands in UNIX/Linux without complaint, and they generally 
> don't have a problem spending time on this conversion stuff...it's us 
> Americans who are mostly bitching and complaining about it...not realizing 
> that we won the digital war, here.... (hell, even Linus Torvalds gave up his 
> Finnish citizenship and became a US citizen, that really ought to tell you 
> something)
> 
> Ted
> 
> 
> -----Original Message-----
> From: PLUG <[email protected]> On Behalf Of American Citizen
> Sent: Thursday, December 25, 2025 1:33 PM
> To: [email protected]
> Subject: Re: [PLUG] Ascii versus UTF-8 woes
> 
> Ted:
> 
> I am using vim, but when I attempt to write the UTF-8 file which I saved from 
> the internet browser cut and paste command, into ascii format, vim fails with 
> a curious error
> 
> vim command:
> 
> :write ++enc=ASCII my_ascii_file.txt
> 
> I get the following error:
> 
> "my_ascii_file.txt" E513: Write error, conversion failed (make 'fenc' 
> empty to override)
> WARNING: Original file may be lost or damaged don't quit the editor until the 
> file is successfully written!
> Press ENTER or type command to continue
> 
> And trying to internally set the values of encoding and file encoding seems 
> to work
> 
> :set encoding=ascii
> 
> :set fileencoding=ascii
> 
> except when you double check the encoding, it stays at utf-8
> 
> but the fileencoding appears to be changed to the new value=ascii
> 
> But then when you attempt to overwrite the file or write to a new file, vim 
> throws errors again
> 
> "new_file.txt" E513: Write error, conversion failed (make 'fenc' empty to 
> override)
> WARNING: Original file may be lost or damaged don't quit the editor until the 
> file is successfully written!
> Press ENTER or type command to continue
> 
> So I am unable to get linux vim version 9.1.83 to work to change the encoding.
> 
> I had to actually use notepadqq to paste the browser text and then set the 
> encoding to ascii and this seems to work.
> 
> I suppose you could pipe the file and let tr strip off the non-ascii 
> characters ??? But this means going back in and manually comparing the two 
> files, to see how to fix the omitted characters (if possible)
> 
> TexStudio crashed mysteriously when I turned off its internal file scanning 
> so I had to set the option again.
> 
> Supposedly there is some tex sty code which allows UTF-8 to be used in a tex 
> file. And yes, my editor settings under TexStudio IS UTF-8
> 
> I already have used up at least an hour of time on this problem as iconv 
> doesn't really change a pure ascii file into a UTF-8 file and vim was failing 
> me.
> 
> Randall
> 
> On 12/25/25 11:28, Ted Mittelstaedt wrote:
> > Open the regular textedit, paste into there, save, open the saved file 
> > in TexStudio
> >
> > Ted
> >
> > -----Original Message-----
> > From: PLUG <[email protected]> On Behalf Of American 
> > Citizen
> > Sent: Wednesday, December 24, 2025 7:40 PM
> > To: Portland Linux/Unix Group <[email protected]>
> > Subject: [PLUG] Ascii versus UTF-8 woes
> >
> > Hi:
> >
> > I have a set of tex files which are in pure ascii format. Unfortunately 
> > when I copy material from the internet (Mozilla Firefox browser) it is in 
> > UTF-8 format, not ascii. This appears to be standard behavior for the 
> > internet browsers.
> >
> > When I paste the material into the tex document (using TexStudio) the 
> > paste goes okay. It only blows up when I try to save the newer file. 
> > The
> > UTF-8 characters cannot be saved in ascii format and for some bizarre 
> > reason Tex Studio wont' change the encoding to UTF-8 even though I have the 
> > option set that the editor is working with UTF-8 character set.
> >
> > iconv won't work either, I do the "iconv -f ASCII -t UTF-8 input_file -o 
> > output_file and the file remains ascii.
> >
> > Does anyone have an idea of how I can get TexStudio to wake up and change 
> > the file encoding on the current ascii file to UTF-8?
> >
> > I cannot get iconv to change the ascii file to UTF-8, so I am stuck between 
> > the devil and the deep blue sea.
> >
> > Randall
> >
> >
> >
> 

-- 
Loren M. Lang
[email protected]
http://www.north-winds.org/
IRC: penguin359

Public Key: http://www.north-winds.org/lorenl_pubkey.asc
Fingerprint: 7896 E099 9FC7 9F6C E0ED  E103 222D F356 A57A 98FA

signature.asc
Description: PGP signature

Re: [PLUG] Ascii versus UTF-8 woes

Reply via email to