On 5/28/13 1:01 PM, Arthur Reutenauer wrote:
I am trying to analyze what TeX produces (dumping the contents of
tex.hashtokens() by the way).
Oh, so that was it ... Well, I can say with confidence that there
were probably only three to five people in the world who had any chance
of understanding what you meant by "hash tokens", in your original
email, and none of them is contributing to this discussion (but one of
them definitely is subscribed to this list ;-)
I supposed it was clear, because I posted to luatex asking about hash
tokens. I was obviously misleading others! :)
So part of what I said earlier doesn't apply, you really are looking
at TeX's hash table. This is yet different, and happens at a very low
level. The documentation for that is in tex.web, and the change files
for the different extensions. You're not really doing yourself a favour
by starting with LuaTeX; better to start with Knuth's TeX, in my
opinion. Its source code, along with the comments, actually is
published as a book.
Good to know, I supposed I could start by getting a low level
impression, the same way I do when asking for symbols in an object
files, and next take a look at the disassembly.
So I am only looking at the tokens produced by TeX, feeding a LaTeX
file: I know what LaTeX does, but since it uses TeX as an engine, I
wanted to know what TeX does with my document structure (labels,
chapters, floats, bibliographies, ...).
Which is absurd: shouldn't you look at the source code of *LaTeX*
first, before looking at a dump of TeX's memory? It's almost like you
want to be confused.
That is not my purpose, and by the way, yes I look at the memory when
trying to figure out how a piece of software works (it's part of my
job), especially when you assume that you don't have the source code.
As before, I just dump to file what tex.hashtokens() contains. I can
attach the file if needed.
Yes, obviously, we need the source file. Did you really imagine that
we could say anything substantial about random bits of TeX's memory
without knowing what the input was?
I attached parts of it, since the symbol table is 70K.
===BEGIN===
sffamily
^A
tracingoutput
<=== THIS IS A TAB
^H
^K
macc@palette
^M
^L
^N
@currdir
makesm@sh
pdftrue
?\textless
@@MP:P:curveto
^Y
^[
^Z
luatexUroot
!
<=== THIS IS A SPACE
====END====
OK, so you meant white space. Blank is indeed a misleading word to
call these strings. Yes, there may be white space. Why does it bother
you? "\ " actually is a pretty common user-level command of TeX.
Because it's new. I thought of it as an escaping in C, a sort, let's say
this, of protection of the next character, as in \% (the same way in C
for \").
===BEGIN===
pagecolor
�, <=== THIS IS A WEIRD ONE
skipemptyMPgraphictrue
====END====
With an hex editor, I find that the second line is EF BF BF 2C.
This is perfectly valid UTF-8, it's the byte sequence for two
characters: U+FFFF and U+002C. The former is not supposed to be used in
files, and usually appears as a replacement of an invalid character, and
the latter is simply a comma.
Yes, I knew that once opened the hash file with an hex editor. I knew
TeX didn't have support for unicode, and I thought that lualatex
translated into TeX, which produced an output. So a unicode string was
unexpected, and I thought I messed up with my dump code.
It seems to me that TeX is using a very low level encoding, which I
find again weird (or wrong, in the sense that I don't know how to
correctly dump the tokens).
You may have dumped the tokens correctly, there is a lot of low-level
stuff in TeX. What's surprising to me is that you find it weird!
Pardon me, but I'm used to write code in C, assembly, C++, or whatever
other programming language (mainly those three, in that order). TeX is
very, very different.
Yes, I imagined it was related to the Narnian way of encoding fonts,
but I don't know how it encodes it (I found a document by Rahtz on
TUG, but I see no mention of "<>").
Look again, then. The long string you quoted (<5><6> etc.) clearly is
the fifth argument to \DeclareFontShape, one of the standard NFSS
commands. It's part of the LaTeX2e and is documented in several places,
for example the LaTeX Companion, or, for a free resource,
doc/latex/base/fntguide.pdf in most TeX distributions.
Good!
You don't have this interest, it's ok, but I really do! I like to
know how something works! ;)
You're missing the point. Producing \r@something is *one of the many*
things that happens when you type \label{something}; it's probably the
control sequence whose name is most obviously related to the label you
created, but there is nothing special about that particular control
sequence. That's why I remarked that it's not an interesting fact, and
you probably wouldn't have noticed it, hadn't it been for your biased
approach of looking at static memory dumps.
Far more interesting are the different commands defined by LaTeX when
\label is called, look for "ltxref.dtx" in latex.ltx. The letter "r"
(in \r@something) is introduced in a macro called \newlabel (line 3881
of my copy of latex.ltx), and "@" in \@newl@bel, one line above it.
That is awesome, I now have a place to start!
Anyway, at some point there *is* a static version of a code somewhere,
otherwise there would be no output. Yes, I am biased by my job and
education, but I find hard to grasp the opposition to this approach.
You look "top down", I use the "bottom up" approach :)