On 30 Apr 2010, at 11:09, José Carlos Santos wrote:
> 
> Please consider again this file:
> 
> \documentclass[10pt,a4paper]{book}
> \usepackage[frenchb]{babel}
> \usepackage{fontspec}
> \usepackage{xunicode}
> \usepackage{xltxtra}
> \begin{document}
> \frontmatter
> \tableofcontents
> \XeTeXinputencoding "cp1252"
> \XeTeXdefaultencoding "cp1252"
> \mainmatter\setcounter{secnumdepth}{2}
> \chapter{Général de Gaulle}
> Il était français.
> \end{document}
> 
> When the \tableofcontents command is found, the line
> 
> \XeTeXinputencoding "cp1252"
> 
> was not read yet. Therefore, it seems to me (since XeTeX is unicode-based) 
> that the file .toc is in unicode and therefore that XeLaTeX should have no 
> problem with it.

There are a couple of issues that make this tricker than it might initially 
appear.

First, it's important to be aware that xetex *always* writes auxiliary files 
using utf-8, regardless of the \XeTeXdefaultencoding setting. There is no 
facility to change the \write to use a different encoding. (Perhaps the command 
should have been called \XeTeXdefaultinputencoding, but it's already pretty 
long!)

So when xetex writes the .aux and .toc files, it will take the internal 
character codes of your text and encode them in utf-8 form. If you look at the 
.aux file that is generated from your example, you'll see that the cp1252 
characters in the input have been converted to Unicode and then represented as 
utf-8 byte sequences in that file.

You are correct in expecting that because you've put \tableofcontents before 
the \XeTeXdefaultencoding command, the .toc file should be read as utf-8. 
However, if you examine the .toc file, you'll find that it does NOT contain the 
expected utf-8 version of "Général". Why is this?

The answer lies in how LaTeX creates the new .toc file during a run. It does 
NOT write the TOC entries directly into the .toc file during the run; if it 
did, this would have worked OK -- they'd be written in utf-8, and read as utf-8 
by your \tableofcontents. But if you observe the terminal output during a 
(xe)latex run, you'll see that at the end of the document, it reads the .aux 
file as an input. What's happening is that during the run, the chapters, 
sections, etc are written to the .aux file (in utf-8). Then, at the end, the 
.aux file is closed, then read as an input, and the relevant information 
written to the .toc (and perhaps other files such as .lof and .lot, if you're 
using those features).

The problem is that at this point, the .aux file is read *with* your 
\XeTeXdefaultencoding declaration in force, so the individual utf-8 bytes that 
were written to it now get interpreted as cp1252 characters and mapped to their 
Unicode values, instead of the byte sequences being interpreted as utf-8. 
That's the source of the "junk" you're getting. Those 
utf-8-bytes-interpreted-as-cp1252 then get re-encoded to utf-8 sequences as the 
.toc is written, so in effect the original characters have been "doubly 
encoded".

In this particular case, at least, you can work around the problem by resetting 
the default encoding immediately before the end of the document, so that when 
LaTeX reads in the .aux file at the end of the run, it reads it correctly as 
utf-8. In other words, if you modify this example to become:

  \documentclass[10pt,a4paper]{book}
  \usepackage[frenchb]{babel}
  \usepackage{fontspec}
  \usepackage{xunicode}
  \usepackage{xltxtra}
  \begin{document}
  \frontmatter
  \tableofcontents
  \XeTeXinputencoding "cp1252"
  \XeTeXdefaultencoding "cp1252"
  \mainmatter\setcounter{secnumdepth}{2}
  \chapter{Général de Gaulle}
  Il était français.
  \XeTeXdefaultencoding "utf-8"
  \end{document}

then your table of contents should correctly show "Général".

However, there may be other situations where auxiliary files are written and 
read at unpredictable times during the processing of the document, making it 
more difficult to control the encodings at the right moments. In general, 
moving to an entirely utf-8 environment is a better and more robust way forward.

HTH,

Jonathan




--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Reply via email to