On Fri, Dec 26, 2025 at 09:04:53AM -0800, Ted Mittelstaedt wrote: > Not really surprising. > > Copying and pasting into vim is a no-go because the distributors of vim > decided when they coded up a rip-off of the actual 'vi' command to add > In UTF-8 support - even though the entire command-line terminal environment > that vim is used in - is really an ASCII environment NOT a UTF-8 environment.
Copying and pasting with full Unicode support works just fine in Vim when things are configured correctly. I rely on this daily and don't usually have to think about it anymore. The problem is that many programs like Vim default to very legacy settings for compatibility if not configured otherwise. Some distributions like Fedora will often do a lot more to configure and modernize the environment whereas other distributions like Arch and Slackware expect the end-user to configure everything themselves. I don't rely on my distribution to configure anything and I have all my core home directory rc files in a Git repository that I clone to every new system I work on. It has all the customizations in .bashrc, .vimrc, .muttrc, etc. to ensure everything is properly enabled and my history goes back 20 years now when I first took the effort to make sure everything is enabled for UTF-8 by default even if I am stuck on a system configured for the C locale or other legacy encoding. > > It's important to understand that vim IS NOT vi. This is a common > misperception by newcomers to vi (which, in my personal option, is the > greatest > Text editor ever invented) I am just young enough to have been introduced to Vim and my only Vi editor was Vim when I started it in Vi-compatible mode in the early days. Well, apart from a short stent playing on Solaris 2.6... > I had intended you copy and paste into the GUI text editor that comes with > Linux since you were copying and pasting from a web page - I had not assumed > you were running the command-line version of a web browser :-) As you > discovered notepadqq also supports the UTF-8 stuff but it at least > understands when it writes out a textfile that "text" means ascii. In my terminal environment, I am primarily working in UTF-8 and not ASCII. This requires two main things from the terminal, one is that I have the appropriate environment variables set which means I have this in my .bashrc which is also sourced into my .bash_profile: export LANG=en_US.UTF-8 And it means that my terminal, whether Gnome Terminal, Konsole, or PuTTY, is configured for the character encoding of UTF-8. If I selected a UTF-8 locale when installing Debian then both of these items are normally already completed. With that, I just have to make sure that Vim is configured in non-compatible mode and has the following lines: set fileencodings=ucs-bom,utf-8,default,cp1252 set encoding=utf-8 set fileformats=unix,dos The first line means that if it detects a text file with the UTF-16 BOM character leading the file (a common occurrence with files saved from Windows Notepad), it will auto-concert to/from UTF-8 on load/save. If not, it will try to load it as UTF-8, but if that fails, it will try latin1 followed by Windows-1252, another legacy encoding saved by Windows Notepad when saving as "ANSI" but with certain accented characters that are from the Windows-1252 encoding. This allows me to load a lot of files without ever thinking much about the encoding and it just works. While working in the editor, it is UTF-8 and since my terminal is in UTF-8, it all display correctly. It also converts back on save. Oh, and the last line just lets it detect UNIX or MS-DOS (Windows) line-endings which is auto-converts back and forth on load/save. As for copy and paste, as long as the terminal and Vim are in UTF-8, then copy/paste should work as well. This is true of both using the terminal's built-in copy/paste and with the X11 clipboard access. If you are running a version of Vim compiled with X11 support in a terminal and your $DISPLAY variable is set appropriately, then Vim can access the X11 clipboard by copying/pasting from "+. It will work with full Unicode support as long as your X11 environment is properly configured for UTF-8. As a test, I put together a little webpage writing it in Vim inside a terminal over a remote SSH connection to a server and using X11 forwarding with SSH to paste from the clipboard in the X11 session of my laptop. I found a wide variety of Unicode characters and emoji and that pasted just fine into Vim and also display correctly inside the terminal session. I then save the HTML webpage and made sure that it properly declared itself as UTF-8 in the HTML headers to the browser and the page rendered in Firefox as expected. As a further test, I also opened that same webpage in all of the 4 classic text-mode browsers on Debian with mostly success. The elinks and lynx browsers rendered nearly everything except for the Emojis using a special joiner control character that combines two emojis into one symbol, but nearly everything else worked right down to the colorized Emoji font that I have installed and is accessible to my terminal. links and links2 had more trouble and did not render most of the more exotic characters, but still rendered the simple arrow symbols and accented characters. This is just a limitation in their specific Unicode support they have built-in. If you want to take a look yourself, here's my test page: http://www.north-winds.org/emoji.html > > There's an interesting discussion of the conversion problems here with some > suggestions you could use at the command line: > > https://unix.stackexchange.com/questions/171832/converting-a-utf-8-file-to-ascii-best-effort There are some interesting suggestions on this if you really have to deal with an environment that can't handle non-ASCII > > One commentor recommended this program: > > https://manpages.ubuntu.com/manpages/jammy/man1/konwert.1.html > > I know that this is going to sound terribly privileged and nationalistic but > the fact is that the UNIX operating system was invented in the United States > not in any other country, and the simple reality is that every other country > has had the same access to electronics knowledge and scientific information > since the invention of the vacuum tube - but every other government and > culture on the fact of the Earth pretty much didn't value any of that "tech > stuff" until AFTER us Americans invented it. And NOW, they all want a piece > of the action. Well OK maybe if they all had valued open information, the > free exchange of ideas, scientific advancement, much more than they valued > dictatorial socio-religious crap used to tell people what to do and how to > live and who to screw, then MAYBE they would have gotten to the digital age > FIRST and then maybe us Americans would have to learn Chinese if we wanted to > write software. (there's a reason the Americans using stone knives and > bearskins made it to the moon and back and the Chinese today even though they > manufacture tech that would knock 1969 NASA tech into a cocked hat - still > haven't made it there) Get me drift, here? Indeed, I will agree that the entire attempt to add in multi-byte encodings into POSIX has been a big hack, but it's a hack that's now over 30 years old and pretty well understood at this point. My favorite example is the first double-byte encodings like EUC-JP. They have a mix of single and double-byte characters which can be properly understood when reading left-to-right. Now, one of the tricks was that all double-byte characters are also displayed in a double-wide font. This both helps with many terminal characters that use strlen() to count the length of a string to determine how much screen real estate it will take up and also allows for the more complex rendering needed for those Japanese characters. It indeed very much feels like a hack, but that is now life. > > UTF-8 was tacked on to UNIX as a way of accommodating the rest of the world > who frankly couldn't give a tinker's damn about the digital age - until we > Americans started kicking their butts with it. So it's NEVER going to be > completely fully integrated into the Linux experience the way ASCII is. If, > you, Randall, are stuck having to deal with that interface of American > computing to rest of the world computing - your going to always have to deal > with this fundamental mismatch. UTF-8 is even trickier because it can be anywhere from 1-4 bytes in length even if it only takes up one character cell on the screen so it's now imperative to use functions like wcswidth to find the needed screen real estate instead of counting bytes like strlen() does, but any efforts beyond the legacy 8-bit character sets will require extra effort to implement. UTF-8 is actually quite clean in several ways for being such a large character set. For one, it's much easier to go backwards such as when implementing backspace support. A character with a leading 0-bit will be a valid single-byte US-ASCII character in the range 0x00 to 0x7f. A character with two leading high bits, in other words, 0xc0 to 0xff, will be the first byte in a multi-byte character. In fact, the precise range of it will tell you immediately if this is a 2, 3, or 4 byte sequence so you can skip forward if needed. All later bytes in such a sequence will have a leading 1 and 0, or 0x80 to 0xbf in hex. To backspace, if the right-most byte is in the 0x80 to 0xbf, then keep deleting bytes until you find the 0xc0 to 0xff byte and remove that as the final byte in the backspace. Many other legacy multi-byte encodings have more complicated rules. > > What I find most interesting in all of this is that the tech types in the > REST of the world fully accept this - THEY are NOT in general the ones > complaining about the second-class citizen status of UTF-8. They know that > they came second, they know they came in second because the majority of > people in their culture don't value freedom of choice, and all that other > stuff needed for scientific advancement, and they accept that their native > languages play second fiddle to ASCII. They type "rm" and "ls" and all the > other ASCII commands in UNIX/Linux without complaint, and they generally > don't have a problem spending time on this conversion stuff...it's us > Americans who are mostly bitching and complaining about it...not realizing > that we won the digital war, here.... (hell, even Linus Torvalds gave up his > Finnish citizenship and became a US citizen, that really ought to tell you > something) > > Ted > > > -----Original Message----- > From: PLUG <[email protected]> On Behalf Of American Citizen > Sent: Thursday, December 25, 2025 1:33 PM > To: [email protected] > Subject: Re: [PLUG] Ascii versus UTF-8 woes > > Ted: > > I am using vim, but when I attempt to write the UTF-8 file which I saved from > the internet browser cut and paste command, into ascii format, vim fails with > a curious error > > vim command: > > :write ++enc=ASCII my_ascii_file.txt > > I get the following error: > > "my_ascii_file.txt" E513: Write error, conversion failed (make 'fenc' > empty to override) > WARNING: Original file may be lost or damaged don't quit the editor until the > file is successfully written! > Press ENTER or type command to continue > > And trying to internally set the values of encoding and file encoding seems > to work > > :set encoding=ascii > > :set fileencoding=ascii > > except when you double check the encoding, it stays at utf-8 > > but the fileencoding appears to be changed to the new value=ascii > > But then when you attempt to overwrite the file or write to a new file, vim > throws errors again > > "new_file.txt" E513: Write error, conversion failed (make 'fenc' empty to > override) > WARNING: Original file may be lost or damaged don't quit the editor until the > file is successfully written! > Press ENTER or type command to continue > > So I am unable to get linux vim version 9.1.83 to work to change the encoding. > > I had to actually use notepadqq to paste the browser text and then set the > encoding to ascii and this seems to work. > > I suppose you could pipe the file and let tr strip off the non-ascii > characters ??? But this means going back in and manually comparing the two > files, to see how to fix the omitted characters (if possible) > > TexStudio crashed mysteriously when I turned off its internal file scanning > so I had to set the option again. > > Supposedly there is some tex sty code which allows UTF-8 to be used in a tex > file. And yes, my editor settings under TexStudio IS UTF-8 > > I already have used up at least an hour of time on this problem as iconv > doesn't really change a pure ascii file into a UTF-8 file and vim was failing > me. > > Randall > > On 12/25/25 11:28, Ted Mittelstaedt wrote: > > Open the regular textedit, paste into there, save, open the saved file > > in TexStudio > > > > Ted > > > > -----Original Message----- > > From: PLUG <[email protected]> On Behalf Of American > > Citizen > > Sent: Wednesday, December 24, 2025 7:40 PM > > To: Portland Linux/Unix Group <[email protected]> > > Subject: [PLUG] Ascii versus UTF-8 woes > > > > Hi: > > > > I have a set of tex files which are in pure ascii format. Unfortunately > > when I copy material from the internet (Mozilla Firefox browser) it is in > > UTF-8 format, not ascii. This appears to be standard behavior for the > > internet browsers. > > > > When I paste the material into the tex document (using TexStudio) the > > paste goes okay. It only blows up when I try to save the newer file. > > The > > UTF-8 characters cannot be saved in ascii format and for some bizarre > > reason Tex Studio wont' change the encoding to UTF-8 even though I have the > > option set that the editor is working with UTF-8 character set. > > > > iconv won't work either, I do the "iconv -f ASCII -t UTF-8 input_file -o > > output_file and the file remains ascii. > > > > Does anyone have an idea of how I can get TexStudio to wake up and change > > the file encoding on the current ascii file to UTF-8? > > > > I cannot get iconv to change the ascii file to UTF-8, so I am stuck between > > the devil and the deep blue sea. > > > > Randall > > > > > > > -- Loren M. Lang [email protected] http://www.north-winds.org/ IRC: penguin359 Public Key: http://www.north-winds.org/lorenl_pubkey.asc Fingerprint: 7896 E099 9FC7 9F6C E0ED E103 222D F356 A57A 98FA
signature.asc
Description: PGP signature
