> Date: Fri, 21 Sep 2018 13:32:43 -0400 > From: John J. Xenakis <hew...@jxenakis.com> > Cc: jxenakis...@gmail.com > > (defun 8bit () > "Test 8-bit characters" > (let* ( > (pos (point)) (NL "\n") > (char1 "\235") (char2 "\220") > (pat1 "\235") (pat2 "[\230-\237]") > ) > (insert "This is a char: " char1 NL) > (insert "This is another char: " char2 NL) > (goto-char pos) > (query-replace-regexp pat1 "x") ; replaces > (goto-char pos) > (query-replace-regexp pat2 "y") ; does not work > )) > > Now, open a brand new empty file, and execute this macro. The first > replace works, but the second replace does not. I don't know whether > this is what's supposed to happen, but at least it doesn't work as I > would expect.
After you execute this macro, if you go to the \235 or \220 characters and type "C-x =", what do you see? Does what Emacs says about these raw bytes give you a hint regarding what is going on? > OK, so here's the overall problem. In the process of writing books > and articles, I create text files with text from a variety of sources. > The sources can include copy and paste from web sites, doc files, pdf > files, and application windows, and can also include text generated by > my scripts, usually in Perl or Java. On what OS are you doing all that? I assume Windows, but what versions? And what applications do you copy text from? > I should mention that when I open a file, I use the coding system > "windows-1252-dos." That is probably wrong nowadays. Since you seem to say your files are full of raw bytes, you should use raw-text, not cp1252. (That is, if you cannot resolve your problem in a better way, so that what you get in the buffer before saving it is not raw bytes, but actual non-ASCII characters. Given your answers to some of my questions, maybe we could make that happen, unless you are working with very old applications.) > Sometimes emacs opens one of these text files, and magically decides > that it's a "(Unix)" file. This is a nightmare because then I have "^M" > at the end of each line, and I can't get rid of them. I've written a macro > that replaces all ^M's with "", and that gets rid of them for a while, > but they come back. I've tried using utility programs to convert files > to windows or unix or mac formats, and back again, but the problem is never > fixed. These are all signs of working with files with inconsistent encoding. Emacs employs some guesswork to decide what is the encoding, but it only examines a small portion of the file before it makes the guess, so inconsistent encoding can dupe it into making the wrong decisions. > OK, you may be sorry you asked, but that's what I'm trying to do. I'm not sorry, I actually guessed you have something like that on your hands. > What's the solution? I'd start at "emacs -Q", and upgrade to Emacs 26 if you haven't already. I think you may have accumulated quite a bit of semi-correct hacks trying to solve these problems, and those hacks are now biting you. In "emacs -Q", try copy/pasting text from the applications you care about, and see what apps give you which problems, if any. Then we will try to solve those problems one at a time. Your first problem with the kind of solution you are used to is that you assume \220 etc. are raw 8-bit bytes everywhere you see them in Emacs. That assumption is false, as "C-x =" above shows you. I actually hope that you won't need any such replacements at all, but if you do, we will get to how one should go about doing this safely.