date:20111118

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Keith J. Schultz

Hi Pihilip,

Thoughout, my programming life and experience I have learned
that internal structure means nothing, as long as the result is correct 
when it comes out.

As you rightfully point out the problem lies inside how TeX internally
handles space characters when adding them to its internal structure.

The fact is that initially, TeX was not designed to handle modern typesetting
well. (Xe)TeX's internals are partially quite outdated. It is possible to to 
handle
all this new type of spaces in (Xe)TeX, yet it is quite awkward and you have 
to be
a TeXchian to do it properly.

My personal opinion is that TeX et al. has to be revamped completely. Ideally, 
it should get 
a natural language parser as a front end and the typesetting module as its 
back-end for its
output.

Yes, I know this would not be TeX any more and require a complete different 
structure of the
TeX eco-system. Language modules and the like. I you care to discuss this we 
cam back channel
as it would be to OT, here.

regards
Keith.

Am 17.11.2011 um 20:56 schrieb Philip TAYLOR:

 Ross, I do not dispute your arguments : I was answering
 Keith's question in an honest way.  I (personally) do not
 think of a space in TeX output as a character at all,
 because I am steeped in TeX philosophy; but I am quite
 willing to accept that /if/ the objective is not to
 produce output for the sake of output, but output for
 subsequent processing as input by another program, then
 there /may/ be an argument for outputting a space as a
 variable-width glyph.
 
 However, I do think that what appears in the output stream
 is a secondary consideration; far more important (IMHO) is
 how we represent that space /within XeTeX/.  There is, I am
 sure, not a suggestion on the table that we start to treat
 a conventional space in XeTeX other than as TeX has traditionally
 treated it, and therefore the real question is (to my mind),
 do we adopt an extension of this traditional TeX treatment
 for non-breaking space, thin-space, and any of the other
 not-quite-standard spaces that Unicode encompasses, or do
 we look for an alternative model which /might/ be glyph-
 or character-based ?.




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

[XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

2011-11-18 Thread Keith J. Schultz

Hi All,

Sorry, I go OT here, but in order to debate it is necessary.
Please forgive.

I have to side more with Philip.

What most are forgetting is what (Xe)TeX is intended for.
It is for most a typesetting program(you do mention this below).
It was not designed to handle different languages or actually truly
do wordprocessing in the modern sense. 

Due to the power of the TeX engine, it evolved to deal with different languages
and newer output methods and encodings. The problem with TeX that the basic 
engine has not been redesigned to handle these new developments well.
The internals need to be completely revamped.

Am 17.11.2011 um 20:36 schrieb Ross Moore:

 Hi Phil,
 
 On 17/11/2011, at 23:53, Philip TAYLOR p.tay...@rhul.ac.uk wrote:
 
 Keith J. Schultz wrote:
 
 You mention in a later post that you do consider a space as a printable 
 character.
This line should read as:
  You mention in a later post that you consider a space as a 
 non-printable character.
 
 No, I don't think of it as a character at all, when we are talking
 about typeset output (as opposed to ASCII (or Unicode) input).  
 
 This is fine, when all that you require of your output is that it be visible 
 on
 a printed page. But modern communication media goes much beyond that.
 A machine needs to be able to tell where words and lines end, reflowing 
 paragraphs when appropriate and able to produce a flat extraction of all the 
 text, perhaps also with some indication of the purpose of that text (e.g. by 
 structural tagging).
I would agree with you, but TeX was not designed as a communications 
program, it was designed for creating printed media.
Furthermore, it may be desirable in the Modern World to have every 
programs out used as input for another program.
This ideal is utopia. If you need the output from one program(media) to 
another then you will need a intermediate program/filter
in order to reformat/convert the differences. As with all types of 
communication there will be structures missing/lacking in the other
system. So a one to one conversion will not be possible. You will need 
to use some kind of heuristics or in modern terms intelligence.
 
 In short, what is output for one format should also be able to serve as input 
 for another.
This assertion is completely idealistic. Then again, it is true. It is 
possibly, today, to design a system that goes from audio, to TeX, to printed 
documents
to audio again. Yet, you will need a lot of effort and most likely the 
results will be far from perfect. Though it is workable and require considerable
resources.
 
 Thus the space certainly does play the role of an output character – though 
 the presence of a gap in the positioning of visible letters may serve this 
 role in many, but not all, circumstances.
This depends on what you are outputting. For a printed page and is 
consumed by a human it goes not matter, because humans do not process space 
characters just space, and they even
at times ignore them completely, because it is irrelevant for their 
natural language processing.
For computers on the other hand the use of a space character can be 
very relevant.

In the early days of TeX and LaTeX I have know people to create their 
e-mail with TeX. So you can see TeX is capable of outputting character based 
output.
Furthermore, TeX could be used to produce any form of character based 
formats as its output. 
 
 Clearly
 it is a character on input, but unless it generates a glyph in the
 output stream (which TeX does not, for normal spaces) then it is not
 a character (/qua/ character) on output but rather a formatting
 instruction not dissimilar to (say) end-of-line.
 
 But a formatting instruction for one program cannot serve as reliable input 
 for another.
 A heuristic is then needed, to attempt to infer that a programming 
 instruction must have been used, and guess what kind of instruction it might 
 have been. This is not 100% reliable, so is deprecated in modern methods of 
 data storage and document formats.
Are you not contradicting yourself here! See above.
 XML based formats use tagging, rather that programming instructions. This is 
 the modern way, which is used extensively for communicating data between 
 different software systems.
True it is used, for communicating data. Yet, you are misconceived in 
thinking that it truly solves any of the problems involved different data types 
or content!
You can get a parse tree of the data, yet if a program can not 
understand or process the data/content it is useless. 
Agreed the XML file contains information about it structure and is 
human readable, yet it does NOTHING, for convert from one format to another. 
You still need a parser/filter to 
convert into another format. 
Do not forget you can put practically anything in an XML file; a 
program, image,

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Zdenek Wagner

2011/11/18 Keith J. Schultz keithjschu...@web.de:
 Hi Pihilip,

 Thoughout, my programming life and experience I have learned
 that internal structure means nothing, as long as the result is correct
 when it comes out.

 As you rightfully point out the problem lies inside how TeX internally
 handles space characters when adding them to its internal structure.

 The fact is that initially, TeX was not designed to handle modern typesetting
 well. (Xe)TeX's internals are partially quite outdated. It is possible to to 
 handle
 all this new type of spaces in (Xe)TeX, yet it is quite awkward and you 
 have to be
 a TeXchian to do it properly.

 My personal opinion is that TeX et al. has to be revamped completely. 
 Ideally, it should get
 a natural language parser as a front end and the typesetting module as its 
 back-end for its
 output.

I admit that things could be done better than in nowadays TeX but its
complete revamping seems to me as bad investment. I would rather think
of an FO processor.

 Yes, I know this would not be TeX any more and require a complete different 
 structure of the
 TeX eco-system. Language modules and the like. I you care to discuss this we 
 cam back channel
 as it would be to OT, here.

 regards
        Keith.

 Am 17.11.2011 um 20:56 schrieb Philip TAYLOR:

 Ross, I do not dispute your arguments : I was answering
 Keith's question in an honest way.  I (personally) do not
 think of a space in TeX output as a character at all,
 because I am steeped in TeX philosophy; but I am quite
 willing to accept that /if/ the objective is not to
 produce output for the sake of output, but output for
 subsequent processing as input by another program, then
 there /may/ be an argument for outputting a space as a
 variable-width glyph.

 However, I do think that what appears in the output stream
 is a secondary consideration; far more important (IMHO) is
 how we represent that space /within XeTeX/.  There is, I am
 sure, not a suggestion on the table that we start to treat
 a conventional space in XeTeX other than as TeX has traditionally
 treated it, and therefore the real question is (to my mind),
 do we adopt an extension of this traditional TeX treatment
 for non-breaking space, thin-space, and any of the other
 not-quite-standard spaces that Unicode encompasses, or do
 we look for an alternative model which /might/ be glyph-
 or character-based ?.




 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

2011-11-18 Thread Zdenek Wagner

2011/11/18 Keith J. Schultz keithjschu...@web.de:
 Hi All,
 Sorry, I go OT here, but in order to debate it is necessary.
 Please forgive.

Hi all,
I agree with Keith, I have just a few comments.

 I have to side more with Philip.
 What most are forgetting is what (Xe)TeX is intended for.
 It is for most a typesetting program(you do mention this below).
 It was not designed to handle different languages or actually truly
 do wordprocessing in the modern sense.
 Due to the power of the TeX engine, it evolved to deal with different
 languages
 and newer output methods and encodings. The problem with TeX that the basic
 engine has not been redesigned to handle these new developments well.
 The internals need to be completely revamped.
 Am 17.11.2011 um 20:36 schrieb Ross Moore:

 Hi Phil,
 On 17/11/2011, at 23:53, Philip TAYLOR p.tay...@rhul.ac.uk wrote:

 Keith J. Schultz wrote:

 You mention in a later post that you do consider a space as a printable
 character.

This line should read as:

  You mention in a later post that you consider a space as a
 non-printable character.

 No, I don't think of it as a character at all, when we are talking
 about typeset output (as opposed to ASCII (or Unicode) input).

 This is fine, when all that you require of your output is that it be visible
 on
 a printed page. But modern communication media goes much beyond that.
 A machine needs to be able to tell where words and lines end, reflowing
 paragraphs when appropriate and able to produce a flat extraction of all the
 text, perhaps also with some indication of the purpose of that text (e.g. by
 structural tagging).

OK, tagged PDF is an option, but it is an optional feature, it is not
enforced. You can never be sure that the PDF you get ans an input will
be tagged. Even if spaces were stored as glyph, the original structure
will be lost. I typeset documents where even a paragraph is originally
a nested structure of elements...

 I would agree with you, but TeX was not designed as a communications
 program, it was designed for creating printed media.
 Furthermore, it may be desirable in the Modern World to have every programs
 out used as input for another program.
 This ideal is utopia. If you need the output from one program(media) to
 another then you will need a intermediate program/filter
 in order to reformat/convert the differences. As with all types of
 communication there will be structures missing/lacking in the other
 system. So a one to one conversion will not be possible. You will need to
 use some kind of heuristics or in modern terms intelligence.

 In short, what is output for one format should also be able to serve as
 input for another.

 This assertion is completely idealistic. Then again, it is true. It is
 possibly, today, to design a system that goes from audio, to TeX, to printed
 documents
 to audio again. Yet, you will need a lot of effort and most likely the
 results will be far from perfect. Though it is workable and require
 considerable
 resources.

 Thus the space certainly does play the role of an output character - though
 the presence of a gap in the positioning of visible letters may serve this
 role in many, but not all, circumstances.

 This depends on what you are outputting. For a printed page and is consumed
 by a human it goes not matter, because humans do not process space
 characters just space, and they even
 at times ignore them completely, because it is irrelevant for their natural
 language processing.
 For computers on the other hand the use of a space character can be very
 relevant.
 In the early days of TeX and LaTeX I have know people to create their e-mail
 with TeX. So you can see TeX is capable of outputting character based
 output.
 Furthermore, TeX could be used to produce any form of character based
 formats as its output.

 Clearly
 it is a character on input, but unless it generates a glyph in the
 output stream (which TeX does not, for normal spaces) then it is not
 a character (/qua/ character) on output but rather a formatting
 instruction not dissimilar to (say) end-of-line.

 But a formatting instruction for one program cannot serve as reliable input
 for another.
 A heuristic is then needed, to attempt to infer that a programming
 instruction must have been used, and guess what kind of instruction it might
 have been. This is not 100% reliable, so is deprecated in modern methods of
 data storage and document formats.

 Are you not contradicting yourself here! See above.

 XML based formats use tagging, rather that programming instructions. This is
 the modern way, which is used extensively for communicating data between
 different software systems.

 True it is used, for communicating data. Yet, you are misconceived in
 thinking that it truly solves any of the problems involved different data
 types or content!
 You can get a parse tree of the data, yet if a program can not understand or
 process the data/content it is useless.
 Agreed the XML

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Philip TAYLOR




Zdenek Wagner wrote:


I admit that things could be done better than in nowadays TeX but its
complete revamping seems to me as bad investment. I would rather think
of an FO processor.


And I agree with Zdeněk : this discussion will be productive only
if we focus on what can be accomplished (w.r.t. spaces) with few
or no changes to XeTeX, not on how we might best deal with the
whole (intellectually daunting) issue of optimally typesetting Unicode.

** Phil.


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Ulrike Fischer

Am Fri, 18 Nov 2011 08:31:28 +1100 schrieb Ross Moore:

 Yes, that's the point. The goal of TeX is nice typographical
 appearance. The goal of XML is easy data exchange. If I want to send
 structured data, I send XML, not PDF.
 
 These days people want both.

One question which pops up regularly in the TeX-groups is how can I
insert a code listing in my pdf so that it can be copied and pasted
reliably. 

Currently this is not easy as the heuristics of the readers can
easily loose spaces, you can't encode tabs or a specific number of
spaces. 

Real space characters in the pdf (instead of only visible space)
would help here a lot.


-- 
Ulrike Fischer 



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Philip TAYLOR


Is it safe to assume that these code listings
are restricted to the ASCII character set ?  If
so, yes, spaces are likely to be a problem, but
if the code listing can also include ligature-
digraphs, then these are likely to prove even
more problematic.

** Phipl.

Ulrike Fischer wrote:


One question which pops up regularly in the TeX-groups is how can I
insert a code listing in my pdf so that it can be copied and pasted
reliably.

Currently this is not easy as the heuristics of the readers can
easily loose spaces, you can't encode tabs or a specific number of
spaces.

Real space characters in the pdf (instead of only visible space)
would help here a lot.



--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Zdenek Wagner

2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk:
 Is it safe to assume that these code listings
 are restricted to the ASCII character set ?  If
 so, yes, spaces are likely to be a problem, but
 if the code listing can also include ligature-
 digraphs, then these are likely to prove even
 more problematic.

If the code listing is typeset in a fixed width font, it is usually no
problem. I copied a few code samples from books in PDF, most of them
were typeset by TeX. If I want to copy text in Devanagari, it is
almost impossible. If I take just a simple Hindi work किताब, the best
result I can get will be िकताब (you should se a dotted circle which is
not visible in PDF). The reason is that the first two letters are
U+0915, U+093F but visually the latter is displayed first. After
copying you get the reversed order U+093F, U+0915. This is just one of
many problems with Devanagari. The toUnicode map does not help much
with Indian scripts. I have never tried to copy Arabic from PDF. Or
even the combination of LTR and RTL within a paragraph.

 ** Phipl.
 
 Ulrike Fischer wrote:

 One question which pops up regularly in the TeX-groups is how can I
 insert a code listing in my pdf so that it can be copied and pasted
 reliably.

 Currently this is not easy as the heuristics of the readers can
 easily loose spaces, you can't encode tabs or a specific number of
 spaces.

 Real space characters in the pdf (instead of only visible space)
 would help here a lot.


 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

2011-11-18 Thread Arthur Reutenauer

On Fri, Nov 18, 2011 at 10:16:31AM +0100, Keith J. Schultz wrote in
reply to Ross Moore:
   You are probably a little young to know this, but TeX's original output 
 format was a dvi file.

  I think I'll have this one framed and sent to Ross for his next
birthday.

Arthur


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

2011-11-18 Thread Herbert Schulz


On Nov 18, 2011, at 7:57 AM, Arthur Reutenauer wrote:

 On Fri, Nov 18, 2011 at 10:16:31AM +0100, Keith J. Schultz wrote in
 reply to Ross Moore:
  You are probably a little young to know this, but TeX's original output 
 format was a dvi file.
 
  I think I'll have this one framed and sent to Ross for his next
 birthday.
 
   Arthur


Howdy,

I'll split the cost with you! :-)

Good Luck,

Herb Schulz
(herbs at wideopenwest dot com)






--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

2011-11-18 Thread Adam Twardoch (List)


 Yet, it remains one of the most
 powerful and cheapest typesetting systems to date.
Cheap in terms of initial investment -- surely, as it's open-source
and free.

Cheap in terms of implementing -- not quite so, because you need to
format your sources in a very specific, isolated syntax.

I initially tried to implement TeX in some projects of my own, and
switched to Prince XML (http://princexml.com/ )

I found it much easier to start with, as it takes HTML or XML + Unicode
+ CSS + SVG/bitmaps + OpenType fonts as input, executes JavaScript
during processing, has a rather high-quality, constantly improving
line-breaking algorithm, and produces reliable PDFs. Some aspects of it
are not quite as powerful as TeX's, but other aspects greatly surpass
TeX -- especially in terms of ease of use and quick implementation while
maintaining acceptably high quality.

So I ended up with Prince XML as my tool of choice because it natively
supports my preferred input formats, i.e. the web formats. A
commercial server license costs 3800 USD, which may sound like a lot,
but I found it a fair price to pay for the comfort of being able to use
my content directly and without much debugging/converting/fine-tuning.

Best,
Adam

-- 

May success attend your efforts,
-- Adam Twardoch
(Remove list. from e-mail address to contact me directly.)



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Whitespace in input

2011-11-18 Thread maxwell

On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner
zdenek.wag...@gmail.com
wrote:
 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk:
 Is it safe to assume that these code listings
 are restricted to the ASCII character set ?  If
 so, yes, spaces are likely to be a problem, but
 if the code listing can also include ligature-
 digraphs, then these are likely to prove even
 more problematic.

 If the code listing is typeset in a fixed width font, it is usually no
 problem. I copied a few code samples from books in PDF, most of them
 were typeset by TeX. If I want to copy text in Devanagari, it is
 almost impossible. 

Besides TeX, Dr. Knuth also invented Literate Programming.  In our own
project, we use LP to extract the code listings from the original source
code, rather than from the PDF.  One advantage is that in addition to the
re-ordering at the character level (mentioned in part of Zdenek's email
that I didn't copy over), this allows re-ordering at any arbitrary level,
even entire sections of program code.  (We happen to be using XML to
contain the source of both our text and our programming language
constructs, but that's a different issue.)

I agree that it would be nice to be able to reliably copy Unicode text
from the PDF, but (a) that issue isn't confined to program listings, and
(b) that would only solve the character ordering part of the problem.

   Mike Maxwell


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

[XeTeX] U+00AD

2011-11-18 Thread mskala

Since we're having so much fun with U+00A0, what about U+00AD, which may
or may not mean the same thing as \- ?
-- 
Matthew Skala
msk...@ansuz.sooke.bc.ca People before principles.
http://ansuz.sooke.bc.ca/


--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Zdenek Wagner

2011/11/18 maxwell maxw...@umiacs.umd.edu:
 On Fri, 18 Nov 2011 13:52:56 +0100, Zdenek Wagner
 zdenek.wag...@gmail.com
 wrote:
 2011/11/18 Philip TAYLOR p.tay...@rhul.ac.uk:
 Is it safe to assume that these code listings
 are restricted to the ASCII character set ?  If
 so, yes, spaces are likely to be a problem, but
 if the code listing can also include ligature-
 digraphs, then these are likely to prove even
 more problematic.

 If the code listing is typeset in a fixed width font, it is usually no
 problem. I copied a few code samples from books in PDF, most of them
 were typeset by TeX. If I want to copy text in Devanagari, it is
 almost impossible.

 Besides TeX, Dr. Knuth also invented Literate Programming.  In our own
 project, we use LP to extract the code listings from the original source
 code, rather than from the PDF.  One advantage is that in addition to the
 re-ordering at the character level (mentioned in part of Zdenek's email
 that I didn't copy over), this allows re-ordering at any arbitrary level,

This is a demonstration that glyphs are not the same as characters. I
will startt with a simpler case and will not put Devanagari to the
mail message. If you wish to write a syllable RU, you have to add a
dependent vowel (matra) U to a consonant RA. There is a ligature RU,
so in PDF you will not see RA consonant with U matra but a RU glyph.
Similarly, TRA is a single glyph representing the following
characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
mappings thus it is possible to handle these cases when copying text
from a PDF or when searching. More difficult case is I matra (short
dependent vowel I). As a character it must always follow a consonant
(this is a general rule for all dependent vowels) but visually (as a
glyph) it precedes the consonant group after which it is pronounced.
The sample word was kitab (it means a book). In Unicode (as
characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
script) runs the character sequence through ICU and the result is the
glyph sequence. The original sequence is lost so that when the text is
copied from PDF, we get (not exactly) i*katab. Microsoft suggested
what additional characters should appear in Indic OpenType fonts. One
of them is a dotted ring which denotes a missing consonant. I-matra
must always follow a consonant (in character order). If it is moved to
the beginning of a word, it is wrong. If you paste it to a text
editor, the OpenType rendering engine should display a missing
consonant as a dotted ring (if it is present in the font). In
character order the dotted ring will precede I-matra but in visual
(glyph) order it will be just opposite. Thus the asterisk shows the
place where you will see the dotted circle. This is just one simple
case. I-matra may follow a consonant group, such as in word PRIY
(dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
words will start with the I-matra glyph. The latter will contain two
ordering bugs after copypaste. Consider also word MURTI (statue)
which is a sequence of characters
MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
appear as an accent below the MA glyph. The next glyph will be I-matra
followed by TA followed by RA shown as an upper accent at the right
edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
glyph appears at the end of the syllable although locically (in
character order) it belongs to the beginning. These cases cannot be
solved by toUnicode map because many-to-many mappings are not allowed.
Moreover, a huge amount of mappings will be needed. It would be better
to do the reverse processing independent of toUnicode mappings, to use
ICU or Pango or Uniscribe or whatever to analyze the glyphs and
convert them to characters. The rules are unambiguous but AR does not
do it.

We discuss nonbreakable spaces while we are not yet able to convert
properly printable glyphs to characters when doing copypaste from
PDF...


-- 
Zdeněk Wagner
http://hroch486.icpf.cas.cz/wagner/
http://icebearsoft.euweb.cz



--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Ross Moore

Hi Zdenek,

On 19/11/2011, at 9:51 AM, Zdenek Wagner wrote:

 This is a demonstration that glyphs are not the same as characters. I
 will startt with a simpler case and will not put Devanagari to the
 mail message. If you wish to write a syllable RU, you have to add a
 dependent vowel (matra) U to a consonant RA. There is a ligature RU,
 so in PDF you will not see RA consonant with U matra but a RU glyph.
 Similarly, TRA is a single glyph representing the following
 characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
 mappings thus it is possible to handle these cases when copying text
 from a PDF or when searching. More difficult case is I matra (short
 dependent vowel I). As a character it must always follow a consonant
 (this is a general rule for all dependent vowels) but visually (as a
 glyph) it precedes the consonant group after which it is pronounced.
 The sample word was kitab (it means a book). In Unicode (as
 characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
 I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
 script) runs the character sequence through ICU and the result is the
 glyph sequence. The original sequence is lost so that when the text is
 copied from PDF, we get (not exactly) i*katab.

/ActualText is your friend here.
You tag the content and provide the string that you want to appear
with Copy/Paste as the value associated to a dictionary key.

There is a macro package that can do this with pdfTeX, and it is 
a vital part of my Tagged PDF work for mathematics.
Also, I have an example where the CJK.sty package is extended
to tag Chinese characters built from multiple glyphs so that
Copy/Paste works correctly (modulo PDF reader quirks).

Not sure about XeTeX.

I once tried to talk with Jonathan Kew about what would be needed 
to implement this properly, but he got totally the wrong idea 
concerning glyphs and characters, and what was needed to be done
internally and what by macros. The conversation went nowhere.

 Microsoft suggested
 what additional characters should appear in Indic OpenType fonts. One
 of them is a dotted ring which denotes a missing consonant. I-matra
 must always follow a consonant (in character order). If it is moved to
 the beginning of a word, it is wrong. If you paste it to a text
 editor, the OpenType rendering engine should display a missing
 consonant as a dotted ring (if it is present in the font). In
 character order the dotted ring will precede I-matra but in visual
 (glyph) order it will be just opposite. Thus the asterisk shows the
 place where you will see the dotted circle. This is just one simple
 case. I-matra may follow a consonant group, such as in word PRIY
 (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
 which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
 words will start with the I-matra glyph. The latter will contain two
 ordering bugs after copypaste. Consider also word MURTI (statue)
 which is a sequence of characters

This sounds like each word needs its own /ActualText .
So some intricate programming is certainly necessary.
But \XeTeXinterchartoks  (is that the right spelling?)
should make this possible.

 MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
 appear as an accent below the MA glyph. The next glyph will be I-matra
 followed by TA followed by RA shown as an upper accent at the right
 edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
 glyph appears at the end of the syllable although locically (in
 character order) it belongs to the beginning. These cases cannot be
 solved by toUnicode map because many-to-many mappings are not allowed.

Agreed.  /ToUnicode  is not the right PDF construction for this.

 Moreover, a huge amount of mappings will be needed. It would be better
 to do the reverse processing independent of toUnicode mappings, to use
 ICU or Pango or Uniscribe or whatever to analyze the glyphs and
 convert them to characters. The rules are unambiguous but AR does not
 do it.

Having an external pre-procesor is what I do for tagging mathematics.
It seems like a similarly intricate problem here.

 
 We discuss nonbreakable spaces while we are not yet able to convert
 properly printable glyphs to characters when doing copypaste from
 PDF...

  :-)

 
 
 -- 
 Zdeněk Wagner
 http://hroch486.icpf.cas.cz/wagner/
 http://icebearsoft.euweb.cz

Hope this helps,

Ross


Ross Moore   ross.mo...@mq.edu.au 
Mathematics Department   office: E7A-419  
Macquarie University tel: +61 (0)2 9850 8955
Sydney, Australia  2109  fax: +61 (0)2 9850 8114







--
Subscriptions, Archive, and List information,

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Zdenek Wagner

2011/11/19 Ross Moore ross.mo...@mq.edu.au:
 Hi Zdenek,

 On 19/11/2011, at 9:51 AM, Zdenek Wagner wrote:

 This is a demonstration that glyphs are not the same as characters. I
 will startt with a simpler case and will not put Devanagari to the
 mail message. If you wish to write a syllable RU, you have to add a
 dependent vowel (matra) U to a consonant RA. There is a ligature RU,
 so in PDF you will not see RA consonant with U matra but a RU glyph.
 Similarly, TRA is a single glyph representing the following
 characters: TA+VIRAMA+RA. The toUnicode map supports 1:1 and 1:many
 mappings thus it is possible to handle these cases when copying text
 from a PDF or when searching. More difficult case is I matra (short
 dependent vowel I). As a character it must always follow a consonant
 (this is a general rule for all dependent vowels) but visually (as a
 glyph) it precedes the consonant group after which it is pronounced.
 The sample word was kitab (it means a book). In Unicode (as
 characters) the order is KA+I-matra+TA+A-matra(long)+BA. Visually
 I-matra precedes KA. XeTeX (knowing that it works with a Devanagari
 script) runs the character sequence through ICU and the result is the
 glyph sequence. The original sequence is lost so that when the text is
 copied from PDF, we get (not exactly) i*katab.

 /ActualText is your friend here.
 You tag the content and provide the string that you want to appear
 with Copy/Paste as the value associated to a dictionary key.

I do not know whether the PDF specification has evolved since I read
it the last time. /ActualText allows only single-byte characters, ie
those with codes between 0 and 255, not arbitrary Unicode characters.
/ActualText is demonstrated on German hyphenated words such as Zucker
which is hyphenated as Zuk- ker. I have tried to put /ActualText
manually via a special, I could see it in the PDF file but it did not
work.

When converting a white space to a space character some [complex]
heuristics is needed while proper conversion of glyphs to characters
of Indic scripts require just a few strict rules. The ligatures as TRA
have to appear in the toUnicode map, otherwise its meaning will be
unclear. If you see the I-matra, go to the last consonant in the
sequence and put the I-matra character there. If you see the RA glyph
at the right edge of a syllable, go back to the leftmost consonant in
the group and prepend RA+VIRAMA there. This is all what has to be done
with Devanagari. Other Indic scripts contain two-part vowels but the
rules will be similarly simple. We should not be forced to double the
size of the PDF file. AR and other PDF rendering programs should learn
these simple rules and use them when extracting text.

 There is a macro package that can do this with pdfTeX, and it is
 a vital part of my Tagged PDF work for mathematics.
 Also, I have an example where the CJK.sty package is extended
 to tag Chinese characters built from multiple glyphs so that
 Copy/Paste works correctly (modulo PDF reader quirks).

 Not sure about XeTeX.

 I once tried to talk with Jonathan Kew about what would be needed
 to implement this properly, but he got totally the wrong idea
 concerning glyphs and characters, and what was needed to be done
 internally and what by macros. The conversation went nowhere.

 Microsoft suggested
 what additional characters should appear in Indic OpenType fonts. One
 of them is a dotted ring which denotes a missing consonant. I-matra
 must always follow a consonant (in character order). If it is moved to
 the beginning of a word, it is wrong. If you paste it to a text
 editor, the OpenType rendering engine should display a missing
 consonant as a dotted ring (if it is present in the font). In
 character order the dotted ring will precede I-matra but in visual
 (glyph) order it will be just opposite. Thus the asterisk shows the
 place where you will see the dotted circle. This is just one simple
 case. I-matra may follow a consonant group, such as in word PRIY
 (dear) which is PA+VIRAMA+RA+I-matra+YA or STRIYOCIT (good for women)
 which is SA+VIRAMA+TA+VIRAMA+RA+I-matra+YA+O-matra+CA+I-matra+TA. Both
 words will start with the I-matra glyph. The latter will contain two
 ordering bugs after copypaste. Consider also word MURTI (statue)
 which is a sequence of characters

 This sounds like each word needs its own /ActualText .
 So some intricate programming is certainly necessary.
 But \XeTeXinterchartoks  (is that the right spelling?)
 should make this possible.

 MA+U-matra(long)+RA+VIRAMA+TA+I-matra. Visually the long U-matra will
 appear as an accent below the MA glyph. The next glyph will be I-matra
 followed by TA followed by RA shown as an upper accent at the right
 edge of the syllable. Generally in RA+VIRAMA+consonant+matra the RA
 glyph appears at the end of the syllable although locically (in
 character order) it belongs to the beginning. These cases cannot be
 solved by toUnicode map because many-to-many mappings are not

Re: [XeTeX] Whitespace in input

2011-11-18 Thread Ross Moore

Hi Zdenek,

On 19/11/2011, at 10:30 AM, Zdenek Wagner wrote:

 /ActualText is your friend here.
 You tag the content and provide the string that you want to appear
 with Copy/Paste as the value associated to a dictionary key.
 
 I do not know whether the PDF specification has evolved since I read
 it the last time. /ActualText allows only single-byte characters, ie
 those with codes between 0 and 255, not arbitrary Unicode characters.

That is most certainly not true.
You code up UTF-16BE as Hex strings.

Here is a snippet of the (tagged-pdfLaTeX) source coding from 
the main example that I showed in my  TUG2011 talk. 
The URL for the video of the talk is given in several of my previous emails:

\SMC attr{/ActualTextFEFFD835DC4F\TPDFaloud{1D44F}} noendtext 254 {mi}%
  b%
_{\noEMC%
   \TPDFsub 
\SMC attr{/ActualTextFEFFD835DC58\TPDFaloud{1D458}} noendtext 255 {mi}%
  k%
\EMC 
  }^{\EMC 
\SMC attr{/ActualText( )} noendtext 256 {Span}%
  \pdffakespace
\EMC 
  }%
\TPDFpopbrack 
\SMC attr{/ActualTextFEFF0029\TPDFaloud{0029}} noendtext 257 {mo}%
  \Bigr)%


Inside the resulting PDF, this content looks like:

 1 0 0 1 4.902 2.463 cm
 /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt(  , b ,  )
 BDC
 BT
 /F11 9.9626 Tf
  [(b)]TJ
 ET
 EMC
 1 0 0 1 4.276 4.114 cm
 /Span /MCID 11 /ActualText( )
 BDC
 BT
 /F103 1 Tf
  [( )]TJ
 ET
 EMC
 1 0 0 1 0 -6.577 cm
 /mi /MCID 12 /ActualTextFEFFD835DC58/Alt(  sub k ,  )
 BDC
 BT
 /F10 6.9738 Tf
  [(k)]TJ
 ET
 EMC
 1 0 0 1 4.901 2.463 cm
 /mo /MCID 13 /Alt(  close bracket:,   , )
 BDC


The full PDF passes all of Adobe's validation tests for
correct PDF syntax, Accessible Content, PDF/A-1b compliance.

More particularly:
 
  /mi /MCID 10 /ActualTextFEFFD835DC4F/Alt(  , b ,  )
  BDC
  BT
  /F11 9.9626 Tf
   [(b)]TJ
  ET
  EMC

expresses a math-italic 'b' as :

 1.  the glyph in the position of letter 'b' (in CMMI10  font);

 2.  to be spoken aloud as   , b ,   where commas indicate a slight pause

 3.  to Copy/Paste as the surrogate pair  Ux0D835 Ux0DC4F
  equivalent to a Plane-1 math-italic character 'b' .

The /MCID key is necessary for tagged PDF, but the /Alt and /ActualText
should work independently to full tagging.
The '/mi' is immaterial; it could equally well be  '/Span'. 


 /ActualText is demonstrated on German hyphenated words such as Zucker
 which is hyphenated as Zuk- ker. I have tried to put /ActualText
 manually via a special, I could see it in the PDF file but it did not
 work.

Yes, because it is quite important to position the tagging pieces
correctly within the PDF content stream. It has to balance correctly
with BT ... ET  and the BDC ... EMC  operator pairs, and there may
be other subtle requirements.

Certainly it cannot be done with just a single \special .
There needs to be stuff both before and after the content
that causes actual glyphs to be displayed.


Just using \pdfliteral  is not sufficient with pdfTeX; we needed
a special modification that allowed the  /mi ...BDC 
and  EMC to fit snuggly around the  BT ... ET .

There could be a similar problem with XeTeX's 
 \special{pdf:literal ... }  
(or whatever is the syntax).
This is the issue that I was trying to discuss with JK in 2009 or 2010.


 
 When converting a white space to a space character some [complex]
 heuristics is needed while proper conversion of glyphs to characters
 of Indic scripts require just a few strict rules. The ligatures as TRA
 have to appear in the toUnicode map, otherwise its meaning will be
 unclear. If you see the I-matra, go to the last consonant in the
 sequence and put the I-matra character there. If you see the RA glyph
 at the right edge of a syllable, go back to the leftmost consonant in
 the group and prepend RA+VIRAMA there. This is all what has to be done
 with Devanagari. Other Indic scripts contain two-part vowels but the
 rules will be similarly simple. We should not be forced to double the
 size of the PDF file. AR and other PDF rendering programs should learn
 these simple rules and use them when extracting text.

If you can provide the  UTF-16BE Hex representation of these,
I can create a PDF using it as the /ActualText  replacement for 
some arbitrary string of letters.

This will test whether this is a viable approach for Devanagari.
If so, then it is a matter of working out how to expand this
for a full solution.


 
 There is a macro package that can do this with pdfTeX, and it is
 a vital part of my Tagged PDF work for mathematics.
 Also, I have an example where the CJK.sty package is extended
 to tag Chinese characters built from multiple glyphs so that
 Copy/Paste works correctly (modulo PDF reader quirks).
 
 Not sure about XeTeX.
 
 I once tried to talk with Jonathan Kew about what would be needed
 to implement this properly, but he got totally the wrong idea
 concerning glyphs and characters, and what was needed to be done
 internally and what by macros. The conversation went nowhere.

 -- 
 Zdeněk Wagner


Cheers,

[XeTeX] (OT) Re: TeX in the modern World. (goes OT) Was: Re: Whitespace in input

2011-11-18 Thread Keith J. Schultz

Hi Arthur,

No problem. you have my permission.

I was just judging from his comments. No offense meant.

Me I am almost 50 and have been around computers since the 80s.
First was a Apple IIe, at the university we used a main frame.

regards
Keith.
P.S. Want a signed version.

regards
Keith.


Am 18.11.2011 um 14:57 schrieb Arthur Reutenauer:

 On Fri, Nov 18, 2011 at 10:16:31AM +0100, Keith J. Schultz wrote in
 reply to Ross Moore:
  You are probably a little young to know this, but TeX's original output 
 format was a dvi file.
 
  I think I'll have this one framed and sent to Ross for his next
 birthday.
 
   Arthur
 
 
 --
 Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Whitespace in input

[XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

Re: [XeTeX] Whitespace in input

Re: [XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

Re: [XeTeX] Whitespace in input

Re: [XeTeX] Whitespace in input

Re: [XeTeX] Whitespace in input

Re: [XeTeX] Whitespace in input

Re: [XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

Re: [XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

Re: [XeTeX] TeX in the modern World. (goes OT) Was: Re: Whitespace in input

Re: [XeTeX] Whitespace in input

[XeTeX] U+00AD

Re: [XeTeX] Whitespace in input

Re: [XeTeX] Whitespace in input

Re: [XeTeX] Whitespace in input

Re: [XeTeX] Whitespace in input

[XeTeX] (OT) Re: TeX in the modern World. (goes OT) Was: Re: Whitespace in input

18 matches

Site Navigation

Mail list logo

Footer information