Re: [XeTeX] Anchor names

2011-11-06 Thread Ross Moore
Hi Heiko, and Akira,

On 06/11/2011, at 3:55 AM, Heiko Oberdiek wrote:

>   \special{%
> pdf:ann width 4bp height 2bp depth 2bp<<%
>   /Type/Annot%
>   /foo/ab#abc
>   /Subtype/Link%
>   /Border[0 0 1]%
>   /C[0 0 1]% blue border
>   /A<<%
> /S/GoToR%%
> /F(t.tex)%
> /D<66f6f8>% 
> % Result: <66f6f8>, but ** WARNING ** Failed to convert input 
> string toUTF16...
> % /D%
> % Result: 
>>>%
>   >>%
>   }%

I've verified that this is indeed what happens, with 

  This is XeTeX, Version 3.1415926-2.2-0.9997.4 (TeX Live 2010)


Now looking at the source coding, at:

   
http://ftp.tug.org/svn/texlive/trunk/Build/source/texk/xdvipdfmx/src/spc_pdfm.c?diff_format=u&view=log&pathrev=13771

it is hard to see how those results can occur.

The warning message is only produced when the function

   maybe_reencode_utf8(pdf_obj *instring)

returns a value less than 1 (e.g. -1)
viz. lines 571--578:   function:  modstrings

>>>   }
>>>   else {
>>> r = maybe_reencode_utf8(vp);
>>>   }
>>>   if (r < 0) /* error occured... */
>>> WARN("Failed to convert input string to UTF16...");
>>> }
>>> break;

or  lines 1145--1150  (for  pdf:dest  but not actually used here)

>>> #ifdef  ENABLE_TOUNICODE
>>>   error = maybe_reencode_utf8(name);
>>>   if (error < 0)
>>> WARN("Failed to convert input string to UTF16...");
>>> #endif
>>> array = parse_pdf_object(&args->curptr, args->endptr, NULL);


Now that function should find only ASCII bytes in  '<66f6f8>'
and  '' .
In both cases the string should have remained silently unmodified.

viz.lines 474--481function:  maybe_reencode_utf8

>>>   /* check if the input string is strictly ASCII */
>>>   for (cp = inbuf; cp < inbuf + inlen; ++cp) {
>>> if (*cp > 127) {
>>>   non_ascii = 1;
>>> }
>>>   }
>>>   if (non_ascii == 0)
>>> return 0; /* no need to reencode ASCII strings */


What am I reading wrong? If anything.

Has there been an earlier de-coding of  <>  hex-strings
into byte values, done either by XeTeX or xdvipdfmx ?
If so, then surely it is this which is unneccessary.
(Not XeTeX, since the string is correct in the .xdv file.)

e.g.  function  pst_string_parse_hex   in  pst_obj.c  seems
to be doing this.  But that is only supposed to be used with  
coding from   cmap_read.c  and  t1-load.c .
And these are only meant for interpreting the font data that goes 
into content streams. So I'm at a loss in understanding this.

But  'modstrings'  is applied recursively, and part of it
seems to be checking for a CMap (when appropriate?).
So maybe there is an unintended un-encoding that precedes 
an encoding?


> 
> It seems that *all* literal strings are affected by the
> unhappy reconversions. But the PDF specification lets no choice,
> there are various places for byte strings.
> In the example, if a file name has byte string XY and the destination Z,
> then the file name is XY and the file name Z and nothing else. Otherwise
> neither the file or the destination will be found.
> 
> Thus either (XeTeX/)xdvipdfmx finds a way for specifying arbitrary
> byte strings (at least for PDF strings(/streams)) -- it is a
> requirement of the PDF specification. Or we have to conclude 
> that 8-bit is not supported and that means US-ASCII.
> 
> Yours sincerely
>  Heiko Oberdiek


Hope this helps --- or you can help me  :-)


Cheers,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-419  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114







--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Anchor names

2011-11-05 Thread Heiko Oberdiek
On Sat, Nov 05, 2011 at 04:14:03PM +, Jonathan Kew wrote:

> > Thanks Akira. But caution, it could break bookmark strings that
> > currently works more or less accidently, sometimes with warnings.
> 
> IIRC (it's a while since I looked at any of this), I believe Unicode
> bookmark strings work deliberately (not accidentally) - I think this came
> up early on as an issue, and encoding-form conversion was implemented to
> ensure that it works. (It's possible there are bugs, of course, but it was
> _supposed_ to work!)

The bookmark stuff suffers from the same main problem of arbitrary byte
strings. For example hyperref: hxetex.def is the only driver, where
I had to disable PDFDocEncoding.

> Yes, PDF is a binary format; xetex was not designed to write PDF. It
> writes its output as XDV - also a binary format, of course, but a very
> specific one designed for this purpose - and XDV provides an extension
> mechanism that involves writing "special" strings that a driver is
> expected to understand. The key issue is that the "special" strings xetex
> writes are Unicode strings, not byte strings.

As long as the \special supports a syntax that is free from "big chars"
this is not a problem. Example: PDF strings specified in hex form <...>.
If then the unhexed byte string is kept without reencoding, then
the problem would be solved, for instance. Thus XeTeX can be left
unchanged, the problem can be solved entirely in the driver:
* Providing a special syntax, where arbitrary byte stuff can be
  specified in us-ascii (hex form or other escape mechanisms).
* Further byte string processing without enforcing and reencoding
  to a special encoding.  

> >> It is a tool for processing
> >> text, and is specifically based on Unicode as the encoding for text, with
> >> UTF-8 being its default/preferred encoding form for Unicode, and (more
> >> importantly) the ONLY encoding form that it uses to write output files.
> >> It's possible to READ other encoding forms (UTF-16), or even other
> >> codepages, and have them mapped to Unicode internally, but output is
> >> always written as UTF-8.
> >> 
> >> Now, this should include not only .log file and \write output, but also
> >> text embedded in the .xdv output using \special. Remember that \special
> >> basically writes a sequence of *characters* to the output, and in xetex
> >> those characters are *Unicode* characters. So my expectation would be that
> >> arbitrary Unicode text can be written using \special, and will be
> >> represented using UTF-8 in the argument of the xxxN operation in .xdv. 
> > 
> > That means that arbitrary bytes can't be written using \special,
> > a restriction that is not available in vanilla TeX.
> 
> That's correct. Perhaps regrettable, but that was the design. The argument
> of \special{} is ultimately represented, after macro expansion, etc,
> as (Unicode) text, and Unicode text != arbitrary bytes.

I don't criticize the way \special works with big chars. Of course, these
have to be encoded somehow to get byte data for storing in the .xdv
format. Arbitrary byte data can be encoded in many different
ways (hex, ascii85, \ooo, ...) to fit into a us-ascii string. This
way the restriction of XeTeX's \special does not matter at all.
The driver would then decode the string to get the byte string.
  The syntax for encoding arbitrary bytes is partially already present
(<>-hex-notation for strings), but partially missing (stream data).
And the main problem, the decoded strings gets reencoded and the
binary data destroyed in the process.

> >> If
> >> that \special is destined to be converted to a fragment of PDF data by the
> >> xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form,
> >> I'd expect the driver to be responsible for that conversion.
> > 
> > Suggestions for some of PDF's data structures:
> > 
> > * Strings: It seems that both (...) and the hex form <...> can be
> >  used. In the hex form spaces are ignored, thus a space right
> >  after the opening angle could be used for a syntax extension.
> >  In this case the driver unescapes the hex string to get the
> >  byte string without reencoding to Unicode.
> >  Example:
> >  \special{pdf:dest < c3a46e6368c3b872> [...]}
> >The destination name would be "änchør" as byte string in UTF-8.
> >  \special{pdf:dest < e46e6368f872> [...]}
> >The destination name would be "änchør" as byte string in latin1.
> 
> I don't understand this proposal. How can you (or rather, a driver) tell
> which encoding is the intended interpretation of an arbitrary sequence of
> byte values?

The byte string data type of PDF doesn't have an encoding at all.
Applying an encoding is wrong in the first place and destroys the
data.

The conversion of UTF-8 strings of the special to PDFDocEncoding/UTF-16BE
would be an additional of the driver for *text strings*. But then
the driver has to *know* the string type of a given string
(text string, binary string, ascii string, string) to decide where

Re: [XeTeX] Anchor names

2011-11-05 Thread Akira Kakuto
Dear Jonathan, Heiko,

> IIRC (it's a while since I looked at any of this), I believe
> Unicode bookmark strings work deliberately (not accidentally)
> - I think this came up early on as an issue, and encoding-form
> conversion was implemented to ensure that it works. 
> (It's possible there are bugs, of course, but it was _supposed_ to work!)

I have recovered the reencoding of pdf strings, since we don't have
right bookmarks without the hyperref package.
The destination in pdf:dest is also reencoded. Thus
/D
and
/Names[7 0 R]
in Heiko's exapmle.

Thanks,
Akira



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Anchor names

2011-11-05 Thread Heiko Oberdiek
On Sun, Nov 06, 2011 at 12:57:12AM +0900, Akira Kakuto wrote:

> > > > I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL 
> > > > trunk r24508.
> > > > Now
> > > > /D
> > > > and
> > > > /Names[7 0 R]
> 
> We can choose that both of the above are UTF16BE with BOM,
> by reencoding both of them. Which do you think is beter?

The main problem is that arbitrary byte strings are needed.
Example with a reference to a destination in another file:

\catcode`\{=1
\catcode`\}=2
\pdfpagewidth=100bp
\pdfpageheight=200bp
\shipout\vbox{%
  \kern-1in\relax
  \hbox{%
\kern-1in\relax
\vrule width0pt height200bp depth0pt\relax
% Link annotation at (150bp,50bp)
\raise130bp\hbox to 0pt{%
   \kern70bp %
   \kern-2bp
   \special{%
 pdf:ann width 4bp height 2bp depth 2bp<<%
   /Type/Annot%
   /foo/ab#abc
   /Subtype/Link%
   /Border[0 0 1]%
   /C[0 0 1]% blue border
   /A<<%
 /S/GoToR%%
 /F(t.tex)%
 /D<66f6f8>% 
 % Result: <66f6f8>, but ** WARNING ** Failed to convert input 
string toUTF16...
 % /D%
 % Result: 
   >>%
 >>%
   }%
   \vrule width4bp height2bp depth2bp\relax
   \hss
}%
  }%
}
\end

It seems that *all* literal strings are affected by the
unhappy reconversions. But the PDF specification lets no choice,
there are various places for byte strings.
In the example, if a file name has byte string XY and the destination Z,
then the file name is XY and the file name Z and nothing else. Otherwise
neither the file or the destination will be found.

Thus either (XeTeX/)xdvipdfmx finds a way for specifying arbitrary
byte strings (at least for PDF strings(/streams)) -- it is a
requirement of the PDF specification. Or we have to conclude 
that 8-bit is not supported and that means US-ASCII.

Yours sincerely
  Heiko Oberdiek


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Anchor names

2011-11-05 Thread Jonathan Kew

On 5 Nov 2011, at 15:24, Heiko Oberdiek wrote:

> On Sat, Nov 05, 2011 at 02:45:32PM +, Jonathan Kew wrote:
> 
>> On 5 Nov 2011, at 10:24, Akira Kakuto wrote:
>> 
>>> Dear Heiko,
>>> 
 Conclusion:
 * The encoding mess with 8-bit characters remain even with XeTeX.
>>> 
>>> I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk 
>>> r24508.
>>> Now
>>> /D
>>> and
>>> /Names[7 0 R]
> 
> Thanks Akira. But caution, it could break bookmark strings that
> currently works more or less accidently, sometimes with warnings.

IIRC (it's a while since I looked at any of this), I believe Unicode bookmark 
strings work deliberately (not accidentally) - I think this came up early on as 
an issue, and encoding-form conversion was implemented to ensure that it works. 
(It's possible there are bugs, of course, but it was _supposed_ to work!)

> Perhaps the problem can be solved with a syntax extension, see below.
> 
>> Unfortunately, I have not had time to follow this thread in detail or
>> investigate the issue properly, but I'm concerned this may break other
>> things that currently work, and rely on this conversion between the
>> encoding form in \specials, and the representation needed in PDF.
>> 
>> However, by way of background: xetex was never intended to be a tool for
>> reading and writing arbitrary binary files.
> 
> The PDF file format is a binary file format. To some degree us-ascii
> can be used, but at the cost of flexibility and some restrictions.

Yes, PDF is a binary format; xetex was not designed to write PDF. It writes its 
output as XDV - also a binary format, of course, but a very specific one 
designed for this purpose - and XDV provides an extension mechanism that 
involves writing "special" strings that a driver is expected to understand. The 
key issue is that the "special" strings xetex writes are Unicode strings, not 
byte strings.

> 
>> It is a tool for processing
>> text, and is specifically based on Unicode as the encoding for text, with
>> UTF-8 being its default/preferred encoding form for Unicode, and (more
>> importantly) the ONLY encoding form that it uses to write output files.
>> It's possible to READ other encoding forms (UTF-16), or even other
>> codepages, and have them mapped to Unicode internally, but output is
>> always written as UTF-8.
>> 
>> Now, this should include not only .log file and \write output, but also
>> text embedded in the .xdv output using \special. Remember that \special
>> basically writes a sequence of *characters* to the output, and in xetex
>> those characters are *Unicode* characters. So my expectation would be that
>> arbitrary Unicode text can be written using \special, and will be
>> represented using UTF-8 in the argument of the xxxN operation in .xdv. 
> 
> That means that arbitrary bytes can't be written using \special,
> a restriction that is not available in vanilla TeX.

That's correct. Perhaps regrettable, but that was the design. The argument of 
\special{} is ultimately represented, after macro expansion, etc, as 
(Unicode) text, and Unicode text != arbitrary bytes.

> 
>> If
>> that \special is destined to be converted to a fragment of PDF data by the
>> xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form,
>> I'd expect the driver to be responsible for that conversion.
> 
> Suggestions for some of PDF's data structures:
> 
> * Strings: It seems that both (...) and the hex form <...> can be
>  used. In the hex form spaces are ignored, thus a space right
>  after the opening angle could be used for a syntax extension.
>  In this case the driver unescapes the hex string to get the
>  byte string without reencoding to Unicode.
>  Example:
>  \special{pdf:dest < c3a46e6368c3b872> [...]}
>The destination name would be "änchør" as byte string in UTF-8.
>  \special{pdf:dest < e46e6368f872> [...]}
>The destination name would be "änchør" as byte string in latin1.

I don't understand this proposal. How can you (or rather, a driver) tell which 
encoding is the intended interpretation of an arbitrary sequence of byte values?

>  \special{pdf:dest  [...]}
>The destination name would be the result of the current
>implementation.
> 
> * Streams (\special{pdf: object ...<<...>>stream...endstream}):
>  Instead of the keyword "stream" "hexstream" could be introduced.
>  The driver then takes a hex string, unhexes it to get the byte
>  data for the stream, also without reencoding to Unicode.

I'm only vaguely aware of the various \special{}s that are supported by 
xdvipdfmx (this stuff is inherited from DVIPDFMx), but yes, I think that's 
where this issue should be fixed. But it _also_ needs the cooperation of macro 
package authors, in that macros designed to directly generate binary PDF 
streams and send them out via \special cannot be expected to work unchanged - 
they're assuming that the argument of \special{...} expands to a string of 
8-bit bytes, not a string of Unicode characte

Re: [XeTeX] Anchor names

2011-11-05 Thread Akira Kakuto
Dear Heiko, Jonathan,

> > > I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk 
> > > r24508.
> > > Now
> > > /D
> > > and
> > > /Names[7 0 R]

We can choose that both of the above are UTF16BE with BOM,
by reencoding both of them. Which do you think is beter?

> Thanks Akira. But caution, it could break bookmark strings that
> currently works more or less accidently, sometimes with warnings.

In my several tests, Japanese bookmarks are ok without warnings
if I use the hyperref.  There were warnings before.

Thanks,
Akira



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Anchor names

2011-11-05 Thread Heiko Oberdiek
On Sat, Nov 05, 2011 at 02:45:32PM +, Jonathan Kew wrote:

> On 5 Nov 2011, at 10:24, Akira Kakuto wrote:
> 
> > Dear Heiko,
> > 
> >> Conclusion:
> >> * The encoding mess with 8-bit characters remain even with XeTeX.
> > 
> > I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk 
> > r24508.
> > Now
> > /D
> > and
> > /Names[7 0 R]

Thanks Akira. But caution, it could break bookmark strings that
currently works more or less accidently, sometimes with warnings.
Perhaps the problem can be solved with a syntax extension, see below.

> Unfortunately, I have not had time to follow this thread in detail or
> investigate the issue properly, but I'm concerned this may break other
> things that currently work, and rely on this conversion between the
> encoding form in \specials, and the representation needed in PDF.
> 
> However, by way of background: xetex was never intended to be a tool for
> reading and writing arbitrary binary files.

The PDF file format is a binary file format. To some degree us-ascii
can be used, but at the cost of flexibility and some restrictions.

> It is a tool for processing
> text, and is specifically based on Unicode as the encoding for text, with
> UTF-8 being its default/preferred encoding form for Unicode, and (more
> importantly) the ONLY encoding form that it uses to write output files.
> It's possible to READ other encoding forms (UTF-16), or even other
> codepages, and have them mapped to Unicode internally, but output is
> always written as UTF-8.
> 
> Now, this should include not only .log file and \write output, but also
> text embedded in the .xdv output using \special. Remember that \special
> basically writes a sequence of *characters* to the output, and in xetex
> those characters are *Unicode* characters. So my expectation would be that
> arbitrary Unicode text can be written using \special, and will be
> represented using UTF-8 in the argument of the xxxN operation in .xdv. 

That means that arbitrary bytes can't be written using \special,
a restriction that is not available in vanilla TeX.

> If
> that \special is destined to be converted to a fragment of PDF data by the
> xdv-to-pdf output driver (xdvipdfmx), and needs a different encoding form,
> I'd expect the driver to be responsible for that conversion.

Suggestions for some of PDF's data structures:

* Strings: It seems that both (...) and the hex form <...> can be
  used. In the hex form spaces are ignored, thus a space right
  after the opening angle could be used for a syntax extension.
  In this case the driver unescapes the hex string to get the
  byte string without reencoding to Unicode.
  Example:
  \special{pdf:dest < c3a46e6368c3b872> [...]}
The destination name would be "änchør" as byte string in UTF-8.
  \special{pdf:dest < e46e6368f872> [...]}
The destination name would be "änchør" as byte string in latin1.
  \special{pdf:dest  [...]}
The destination name would be the result of the current
implementation.

* Streams (\special{pdf: object ...<<...>>stream...endstream}):
  Instead of the keyword "stream" "hexstream" could be introduced.
  The driver then takes a hex string, unhexes it to get the byte
  data for the stream, also without reencoding to Unicode.

> What I would NOT expect to work is for a TeX macro package to generate
> arbitrary binary data (byte streams) and expect these to be passed
> unchanged to the output. I suspect that's what Heiko's macros probably do,
> and it worked in pdftex where "tex character" == "byte", but it's
> problematic when "tex character" == "Unicode character".

Yes, that's the problem. PDF is a binary format, not a Unicode text format.

Yours sincerely
  Heiko Oberdiek


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Anchor names

2011-11-05 Thread Jonathan Kew
On 5 Nov 2011, at 10:24, Akira Kakuto wrote:

> Dear Heiko,
> 
>> Conclusion:
>> * The encoding mess with 8-bit characters remain even with XeTeX.
> 
> I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk 
> r24508.
> Now
> /D
> and
> /Names[7 0 R]
> 
> Thanks,
> Akira

Unfortunately, I have not had time to follow this thread in detail or 
investigate the issue properly, but I'm concerned this may break other things 
that currently work, and rely on this conversion between the encoding form in 
\specials, and the representation needed in PDF.

However, by way of background: xetex was never intended to be a tool for 
reading and writing arbitrary binary files. It is a tool for processing text, 
and is specifically based on Unicode as the encoding for text, with UTF-8 being 
its default/preferred encoding form for Unicode, and (more importantly) the 
ONLY encoding form that it uses to write output files. It's possible to READ 
other encoding forms (UTF-16), or even other codepages, and have them mapped to 
Unicode internally, but output is always written as UTF-8.

Now, this should include not only .log file and \write output, but also text 
embedded in the .xdv output using \special. Remember that \special basically 
writes a sequence of *characters* to the output, and in xetex those characters 
are *Unicode* characters. So my expectation would be that arbitrary Unicode 
text can be written using \special, and will be represented using UTF-8 in the 
argument of the xxxN operation in .xdv. If that \special is destined to be 
converted to a fragment of PDF data by the xdv-to-pdf output driver 
(xdvipdfmx), and needs a different encoding form, I'd expect the driver to be 
responsible for that conversion.

What I would NOT expect to work is for a TeX macro package to generate 
arbitrary binary data (byte streams) and expect these to be passed unchanged to 
the output. I suspect that's what Heiko's macros probably do, and it worked in 
pdftex where "tex character" == "byte", but it's problematic when "tex 
character" == "Unicode character".

JK




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Anchor names

2011-11-05 Thread Akira Kakuto
Dear Heiko,

> > >>> Conclusion:
> > >>> * The encoding mess with 8-bit characters remain even with XeTeX.

I have disabled to reencode pdf strings to UTF-16 in xdvipdfmx: TL trunk r24508.
Now
/D
and
/Names[7 0 R]

Thanks,
Akira



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Anchor names

2011-11-04 Thread Heiko Oberdiek
On Sat, Nov 05, 2011 at 11:59:29AM +1100, Ross Moore wrote:

> >>> Conclusion:
> >>> * The encoding mess with 8-bit characters remain even with XeTeX.
> >> 
> >> Well, surely it is manifest only in the driver part:  xdvipdfmx
> > 
> > No, the problem are both parts. XeTeX can only write UTF-8,
> > the death for binary data.
> 
> But the bytes need to be encoded anyway, as hexadecimal.
> So why cannot this be done before writing out the resulting string?

See my example file, it get's reencoded.

> >>> Then I tried to be clever and a workaround by using
> >>> /D for the link name in the source.
> >>> But it got converted and the PDF file still contains:
> >>> /D
> >>> 
> >>> Only the other way worked:
> >>> 
> >>> \special{pdf:dest  ...}
> >>> \special{pdf:ann ... /D(änchør) ...}
> 
>  ... as this seems to be doing.
> I'd vote for *always* doing  pdf:dest  this way. 
> Then for consistency, do  pdf:ann  as if UTF-16BE  also.

It might be an accident that this way has worked. If the
bug is fixed, then it might only work the other way or
none way at all or ...
  Also instead of "http://.../test.pdf#Introduction"; you 
would have to write something like
  "http://.../test.pdf#%FE%FF%49%6E%74%72%6F%64%75%63%74%69%6F%6E
Somehow I missed to see that as improvement?

> >> OK. 
> >> Glad you did this test.
> >> It shows two things:
> >> 
> >>  1.  that such text strings may well be valid for Names,
> >>  and that the PDF spec. is unclear about this;
> > 
> > I can't follow. Both string representations are covered
> > by the PDF specification, a literal string can be
> > specified in parentheses with an escaping mechanism (backslash)
> > or given as hex string in angle brackets. Unclear is the
> > syntax of the argument for \special{pdf:dest ...}.
> 
> Agreed.
> Can we standardise on the way that *looks like*  UTF-16BE with BOM.

That's a higher level and it's an artificial restriction of such a kind
that started the thread.
  Already the lower syntax level is unclear. The best solution would be,
if a syntax could be specified/implemented/
supported that allows byte strings. That means someone has to
dig into the sources and do some work, write some documentation ...

> >>  2.  these UTF16-BE strings are *not* equivalent to other
> >>  ways of encoding Name objects, after all.
> >> 
> >> This is something that should be reported as a bug to Adobe.
> > 
> > There is no problem with the PDF specification. A destination
> > name is a byte string. You can use UTF-16BE, invalid UTF-8,
> > a mixture of UTF-32BE with us-ascii, ... all are valid byte strings.
> > The problem is with xdvipdfmx that recodes the UTF-8 string
> > provided by XeTeX's specials in different ways.
> 
> Then convert the UTF-8 to the encoded HeX of the corresponding UTF16-BE,
> before passing it to  xdvipdfmx .
> 
> Surely that is feasible?

Except that the behaviour of 8-bit characters in destination strings
are unspecified and undocumented. It makes more sense to address
the problem upstream first.

> > pdfTeX is fine, because it doesn't reencode the strings.
> > Also \pdfescapestring, \pdfescapename, \pdfescapehex
> > are available for syntactically correct literal strings.
> 
> I've not used these primitives.
> Didn't you used to do such conversions within hyperref ?

Of course hyperref uses such conversions, these are required
by the PDF specification.

> Or with other utility packages in the 'oberdiek' bundle?

If a package misses such a necessary escaping make a bug
report.

Yours sincerely
  Heiko Oberdiek


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] Anchor names

2011-11-03 Thread Heiko Oberdiek
On Fri, Nov 04, 2011 at 07:31:02AM +1100, Ross Moore wrote:

> On 04/11/2011, at 1:58 AM, Heiko Oberdiek wrote:
> 
> > Hello,
> > 
> > to get more to the point, I start a new thread.
> 
> Yes. very good idea.
> 
> > As we have learned, the PDF specification uses byte strings
> > for anchor names. And there is a wish to use "normal" characters
> > in anchor names.
> 
> Within the (La)TeX source, yes!
> Of course it needs to be encoded to be safe within the PDF.

That's the problem, the anchor names could also be used as
"official" part of the PDF file, because it could be referenced, e.g.:
  mybeautifuldocument.pdf#Introduction

> 
> > Let's make an example:
> > 
> > xetex --ini --output-driver='xdvipdfmx -V4' test
> > 
> >  \special{pdf:dest (änchør) [@thispage /XYZ @xpos @ypos null]}%
> 
> >   \special{%
> > pdf:ann width 4bp height 2bp depth 2bp<<%
> >   /Type/Annot%
> >   /Subtype/Link%
> >   /Border[0 0 1]%
> >   /C[0 0 1]% blue border
> >   /A<<%
> > /S/GoTo%
> > /D(änchør)%
> 
> > The link is not working. Looking into the PDF file we can find
> > the link annotation:
> > 
> >  4 0 obj
> >  <<
> >  /Type/Annot
> >  /Subtype/Link
> >  /Border[0 0 1]
> >  /C[0 0 1]
> >  /A<<
> >  /S/GoTo
> >  /D
> 
> In my reading of the PDF Spec. I came to the conclusion
> that this UTF-16BE based format is not supported for Name objects.
> 
> But maybe I'm wrong here.

My understanding is that it does not matter, whether the byte
string could be interpreted in some encoding. The characters
are just bytes. Also there are keys in the /Dests name tree
and are compared at the byte level. Thus a name encoded as
UTF-8, ISO-8859-1 or UTF-16BE are different strings and thus
different names.

> > Destination:  ==> UTF-8
> > Link annot.:  ==> UTF-16BE with BOM
> 
> The spec reads that differences in Literal strings are allowed,
> provided that they convert to the same thing in Unicode.
> So there must be an internal representation that Adobe uses,
> but is not visible to us, as builders of PDF documents.

Where, which section?

A literal string can be written different ways at syntax level:

  (test) = <74657374> = (\164\145\163\164) = (\164e\163t)

Probably you are referring the "Text String Type" used for
the text in the bookmarks, the document information and other
places. These strings can be encoded either in PDFDocEncoding
or UTF-16BE with BOM.

> > Conclusion:
> > * The encoding mess with 8-bit characters remain even with XeTeX.
> 
> Well, surely it is manifest only in the driver part:  xdvipdfmx

No, the problem are both parts. XeTeX can only write UTF-8,
the death for binary data.

> > Then I tried to be clever and a workaround by using
> > /D for the link name in the source.
> > But it got converted and the PDF file still contains:
> > /D
> > 
> > Only the other way worked:
> > 
> >  \special{pdf:dest  ...}
> >  \special{pdf:ann ... /D(änchør) ...}
> 
> OK. 
> Glad you did this test.
> It shows two things:
> 
>   1.  that such text strings may well be valid for Names,
>   and that the PDF spec. is unclear about this;

I can't follow. Both string representations are covered
by the PDF specification, a literal string can be
specified in parentheses with an escaping mechanism (backslash)
or given as hex string in angle brackets. Unclear is the
syntax of the argument for \special{pdf:dest ...}.

>   2.  these UTF16-BE strings are *not* equivalent to other
>   ways of encoding Name objects, after all.
> 
> This is something that should be reported as a bug to Adobe.

There is no problem with the PDF specification. A destination
name is a byte string. You can use UTF-16BE, invalid UTF-8,
a mixture of UTF-32BE with us-ascii, ... all are valid byte strings.
The problem is with xdvipdfmx that recodes the UTF-8 string
provided by XeTeX's specials in different ways.

> Can you produce a set of 3 or more PDFs that show the different 
> behaviours ?
> 
> Better still: a single PDF that illustrates the (non-)working
> of hyperlinks according to the encodings of the Name objects
> and Destinations.

Save my example as "test.tex" and run
"xetex --ini --output-driver='xdvipdfmx -V4' test"
(I miss an easy switch for XeTeX to set the PDF version).
With PDF-1.4 object stream compression is not available
and the PDF file can be analyzed directly using a simple
text viewer. (Otherwise the destination and annotation
objects are compressed).

> Do it both with XeTeX and pdfTeX (with appropriate inputenc, 
> to handle the UTF8 input), to test whether there are any 
> differences.  

pdfTeX is fine, because it doesn't reencode the strings.
Also \pdfescapestring, \pdfescapename, \pdfescapehex
are available for syntactically correct literal strings.

> I've not tested pdfTeX yet, because of the extra macro layer
> required. Does  hyperref  handle the required conversions then? 

It depends on which part of hyperref you are looking.

Yours sincerely
  Heiko Ob

Re: [XeTeX] Anchor names

2011-11-03 Thread Ross Moore
Hi Heiko,

On 04/11/2011, at 1:58 AM, Heiko Oberdiek wrote:

> Hello,
> 
> to get more to the point, I start a new thread.

Yes. very good idea.

> As we have learned, the PDF specification uses byte strings
> for anchor names. And there is a wish to use "normal" characters
> in anchor names.

Within the (La)TeX source, yes!
Of course it needs to be encoded to be safe within the PDF.

> Let's make an example:
> 
> xetex --ini --output-driver='xdvipdfmx -V4' test
> 
>  \special{pdf:dest (änchør) [@thispage /XYZ @xpos @ypos null]}%

>   \special{%
> pdf:ann width 4bp height 2bp depth 2bp<<%
>   /Type/Annot%
>   /Subtype/Link%
>   /Border[0 0 1]%
>   /C[0 0 1]% blue border
>   /A<<%
> /S/GoTo%
> /D(änchør)%

> The link is not working. Looking into the PDF file we can find
> the link annotation:
> 
>  4 0 obj
>  <<
>  /Type/Annot
>  /Subtype/Link
>  /Border[0 0 1]
>  /C[0 0 1]
>  /A<<
>  /S/GoTo
>  /D

In my reading of the PDF Spec. I came to the conclusion
that this UTF-16BE based format is not supported for Name objects.

But maybe I'm wrong here.

>>> 
>  /Rect[68 48 72 52]
>>> 
>  endobj
> 
> and the destination:
> 
> 7 0 obj
> [3 0 R/XYZ 30 150 null]
> endobj
> 8 0 obj
> <<
> /Names[7 0 R]
>>> 
> endobj


> The positions of both the link annotation and the destination are perfect.
> The name for "änchør" is given both times as hexadecimal string.
> That's ok, too. But the names are different:
> 
> Destination:  ==> UTF-8
> Link annot.:  ==> UTF-16BE with BOM

The spec reads that differences in Literal strings are allowed,
provided that they convert to the same thing in Unicode.
So there must be an internal representation that Adobe uses,
but is not visible to us, as builders of PDF documents.

> 
> Conclusion:
> * The encoding mess with 8-bit characters remain even with XeTeX.

Well, surely it is manifest only in the driver part:  xdvipdfmx

> Then I tried to be clever and a workaround by using
> /D for the link name in the source.
> But it got converted and the PDF file still contains:
> /D
> 
> Only the other way worked:
> 
>  \special{pdf:dest  ...}
>  \special{pdf:ann ... /D(änchør) ...}

OK. 
Glad you did this test.
It shows two things:

  1.  that such text strings may well be valid for Names,
  and that the PDF spec. is unclear about this;

  2.  these UTF16-BE strings are *not* equivalent to other
  ways of encoding Name objects, after all.

This is something that should be reported as a bug to Adobe.

Can you produce a set of 3 or more PDFs that show the different 
behaviours ?

Better still: a single PDF that illustrates the (non-)working
of hyperlinks according to the encodings of the Name objects
and Destinations.

Do it both with XeTeX and pdfTeX (with appropriate inputenc, 
to handle the UTF8 input), to test whether there are any 
differences.  
I've not tested pdfTeX yet, because of the extra macro layer
required. Does  hyperref  handle the required conversions then? 

> 
> Result:
> * Even for nice short names the size is doubled and increased
>  by two bytes.
> * Assymetrical behaviour of \special commands.
> * No documentation.
> * Unfair, arbitrary byte strings can't be written.
> 
> Yours sincerely
>  Heiko Oberdiek


Thanks for looking at this in detail.


Cheers,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-419  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114







--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


[XeTeX] Anchor names

2011-11-03 Thread Heiko Oberdiek
Hello,

to get more to the point, I start a new thread.
As we have learned, the PDF specification uses byte strings
for anchor names. And there is a wish to use "normal" characters
in anchor names. Let's make an example:

xetex --ini --output-driver='xdvipdfmx -V4' test

%%% test.tex %%%
\catcode`\{=1
\catcode`\}=2
\pdfpagewidth=100bp
\pdfpageheight=200bp
\shipout\vbox{%
  \kern-1in\relax
  \hbox{%
\kern-1in\relax
\vrule width0pt height200bp depth0pt\relax
% Destination at (30bp,150bp)
\raise150bp\hbox to 0pt{%
  \kern30bp %
  \special{pdf:dest (änchør) [@thispage /XYZ @xpos @ypos null]}%
  \kern-1bp
  \vrule width2bp height1bp depth1bp\relax
  \hss
}%
% Link annotation center at (70bp,50bp),
% rectangle at [68bp 48bp 72bp 52bp]
\raise50bp\hbox to 0pt{%
   \kern70bp %
   \kern-2bp
   \special{%
 pdf:ann width 4bp height 2bp depth 2bp<<%
   /Type/Annot%
   /Subtype/Link%
   /Border[0 0 1]%
   /C[0 0 1]% blue border
   /A<<%
 /S/GoTo%
 /D(änchør)%
   >>%
 >>%
   }%
   \vrule width4bp height2bp depth2bp\relax
   \hss
}%
  }%
}
\end
%%% test.tex %%%

The link is not working. Looking into the PDF file we can find
the link annotation:

  4 0 obj
  <<
  /Type/Annot
  /Subtype/Link
  /Border[0 0 1]
  /C[0 0 1]
  /A<<
  /S/GoTo
  /D
  >>
  /Rect[68 48 72 52]
  >>
  endobj

and the destination:

7 0 obj
[3 0 R/XYZ 30 150 null]
endobj
8 0 obj
<<
/Names[7 0 R]
>>
endobj

The positions of both the link annotation and the destination are perfect.
The name for "änchør" is given both times as hexadecimal string.
That's ok, too. But the names are different:

Destination:  ==> UTF-8
Link annot.:  ==> UTF-16BE with BOM

Conclusion:
* The encoding mess with 8-bit characters remain even with XeTeX.

Then I tried to be clever and a workaround by using
/D for the link name in the source.
But it got converted and the PDF file still contains:
/D

Only the other way worked:

  \special{pdf:dest  ...}
  \special{pdf:ann ... /D(änchør) ...}

Result:
* Even for nice short names the size is doubled and increased
  by two bytes.
* Assymetrical behaviour of \special commands.
* No documentation.
* Unfair, arbitrary byte strings can't be written.

Yours sincerely
  Heiko Oberdiek


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex