Re: [XeTeX] weird behaviour with LetterSpace

2010-06-01 Thread Jonathan Kew
On 31 May 2010, at 22:13, Pablo Rodríguez wrote:

 Hi there,
 
 I have just accidentally discovered that LetterSpace behaves differently if 
 the whole paragraph is set with this feature or not.
 
 The minimal example:
 
 \documentclass[12pt]{article}
 \usepackage{fontspec}
 \setmainfont{Theano Didot}
 \begin{document}
 χαλεπὰ \addfontfeature{LetterSpace=12}τὰ καλά
 
 χαλεπὰ τὰ καλά
 
 χαλεπὰ \addfontfeature{LetterSpace=0}τὰ καλά
 
 Beauty \addfontfeature{LetterSpace=12}is difficult
 
 Beauty is difficult
 
 Beauty \addfontfeature{LetterSpace=0}is difficult
 \end{document}
 
 If you copy the resulting text (from 
 http://www.ousia.tk/wrong-letterspace.pdf), you will see that only the second 
 line is properly typeset, or at least, there are no blank spaces between 
 letters.
 
 I guess this might be a probable cause for wrong hyphenation when using 
 LetterSpace. (BTW, loading polyglossia makes no difference.)
 
 Have I hit a bug in LetterSpace? Do you know any way to avoid this?

The PDF looks correct to me; where LetterSpace=12 is in effect, the letters are 
more widely spaced, and where LetterSpace=0, they're not. I don't see a bug 
here. Or am I missing something?

If you're specifically concerned about what happens when you use a viewer to 
select and copy the text from this PDF into an editor... well... that's a 
chancy operation. It worked fine for me with Acrobat (no extra spaces), but 
other viewers may give different results. Basically, this is a poorly-defined 
operation. As TeX does not use space characters between words, there is no 
clear indication in the PDF data of where the word boundaries should be, and so 
the viewer has to guess based on the glyph positions. That works most of the 
time for simple running text, but modifying the letter spacing carries a pretty 
high risk of confusing it.

As I see it, PDF was not really designed to be an interchange medium for text; 
it's designed to convey the graphical appearance of the page. Extracting the 
underlying text from the glyphs on the page is an afterthought that has never 
been 100% reliable. Added features such as /ActualText can help, but xetex does 
not currently support the automatic generation of /ActualText in the PDF output 
-- and I'd be reluctant to add it, considering how much it would bloat the 
output.

Basically, if you want to get the text reliably, you shouldn't be starting from 
the PDF! :)

JK




--
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex


Re: [XeTeX] weird behaviour with LetterSpace

2010-06-01 Thread Pablo Rodríguez

On 06/01/2010 10:25 AM, Jonathan Kew wrote:

On 31 May 2010, at 22:13, Pablo Rodríguez wrote:

[...]
If you copy the resulting text (from 
http://www.ousia.tk/wrong-letterspace.pdf), you will see that only the second 
line is properly typeset, or at least, there are no blank spaces between 
letters.

I guess this might be a probable cause for wrong hyphenation when using 
LetterSpace. (BTW, loading polyglossia makes no difference.)

Have I hit a bug in LetterSpace? Do you know any way to avoid this?


The PDF looks correct to me; where LetterSpace=12 is in effect, the letters are 
more widely spaced, and where LetterSpace=0, they're not. I don't see a bug 
here. Or am I missing something?


Thanks for your reply, Jonathan.

I'm not especially interested in LetterSpace, but in hyphenation with 
Letterspace (as you can see at http://www.ousia.tk/grammatike.pdf).


And I thought that the described issue might influence the wrong 
hyphenation (but I got it wrong).



If you're specifically concerned about what happens when you use a viewer to select and 
copy the text from this PDF into an editor... well... that's a chancy operation. It 
worked fine for me with Acrobat (no extra spaces), but other viewers may give different 
results. Basically, this is a poorly-defined operation. As TeX does not use space 
characters between words, there is no clear indication in the PDF data of where the 
word boundaries should be, and so the viewer has to guess based on the glyph positions. 
That works most of the time for simple running text, but modifying the letter spacing 
carries a pretty high risk of confusing it.


BTW, acroread-9.3 in Ubuntu-10.04 copies the following text (the same 
text that evince 2.30 copies):


χαλεπὰ τ ὰ κ α λ ά
χ α λ ε π ὰ τ ὰ κ α λ ά
χ α λ ε π ὰ τὰ καλά
Beauty i s d i ffi c u l t
B e a u t y i s d i ffi c u l t
B e a u t y is difficult

The general issue with LetterSpace is not text extraction itself, but 
the ability to search for a given text.


Thanks for your help,


Pablo


--
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex