Re: [h-e-w] Processing chars above \200

Eli Zaretskii Fri, 21 Sep 2018 11:23:08 -0700

> Date: Fri, 21 Sep 2018 13:32:43 -0400
> From: John J. Xenakis <[email protected]>
> Cc: [email protected]
> 
> (defun 8bit ()
> "Test 8-bit characters"
> (let* (
>      (pos (point))  (NL "\n")
>      (char1 "\235")  (char2 "\220")
>      (pat1  "\235")  (pat2 "[\230-\237]")
>   )
>     (insert "This is a char: " char1 NL)
>     (insert "This is another char: " char2 NL)
>     (goto-char pos)
>     (query-replace-regexp pat1 "x")  ; replaces
>     (goto-char pos)
>     (query-replace-regexp pat2 "y")  ; does not work
> ))
> 
> Now, open a brand new empty file, and execute this macro.  The first
> replace works, but the second replace does not.  I don't know whether
> this is what's supposed to happen, but at least it doesn't work as I
> would expect.


After you execute this macro, if you go to the \235 or \220 characters
and type "C-x =", what do you see?  Does what Emacs says about these
raw bytes give you a hint regarding what is going on?

> OK, so here's the overall problem.  In the process of writing books
> and articles, I create text files with text from a variety of sources.
> The sources can include copy and paste from web sites, doc files, pdf
> files, and application windows, and can also include text generated by
> my scripts, usually in Perl or Java.

On what OS are you doing all that?  I assume Windows, but what
versions?  And what applications do you copy text from?

> I should mention that when I open a file, I use the coding system
> "windows-1252-dos."

That is probably wrong nowadays.  Since you seem to say your files are
full of raw bytes, you should use raw-text, not cp1252.  (That is, if
you cannot resolve your problem in a better way, so that what you get
in the buffer before saving it is not raw bytes, but actual non-ASCII
characters.  Given your answers to some of my questions, maybe we
could make that happen, unless you are working with very old
applications.)

> Sometimes emacs opens one of these text files, and magically decides
> that it's a "(Unix)" file.  This is a nightmare because then I have "^M"
> at the end of each line, and I can't get rid of them.  I've written a macro
> that replaces all ^M's with "", and that gets rid of them for a while,
> but they come back.  I've tried using utility programs to convert files
> to windows or unix or mac formats, and back again, but the problem is never
> fixed.

These are all signs of working with files with inconsistent encoding.
Emacs employs some guesswork to decide what is the encoding, but it
only examines a small portion of the file before it makes the guess,
so inconsistent encoding can dupe it into making the wrong decisions.

> OK, you may be sorry you asked, but that's what I'm trying to do.

I'm not sorry, I actually guessed you have something like that on your
hands.

> What's the solution?

I'd start at "emacs -Q", and upgrade to Emacs 26 if you haven't
already.  I think you may have accumulated quite a bit of semi-correct
hacks trying to solve these problems, and those hacks are now biting
you.

In "emacs -Q", try copy/pasting text from the applications you care
about, and see what apps give you which problems, if any.  Then we
will try to solve those problems one at a time.

Your first problem with the kind of solution you are used to is that
you assume \220 etc. are raw 8-bit bytes everywhere you see them in
Emacs.  That assumption is false, as "C-x =" above shows you.  I
actually hope that you won't need any such replacements at all, but if
you do, we will get to how one should go about doing this safely.

Re: [h-e-w] Processing chars above \200

Reply via email to