2016-03-17 19:02 GMT+01:00 Pierpaolo Bernardi <olopie...@gmail.com>: > On Thu, Mar 17, 2016 at 6:37 PM, Leonardo Boiko <leobo...@namakajiri.net> > wrote: > > The PDF *displays* correctly. But try copying the string 'ti' from > > the text another application outside of your PDF viewer, and you'll > > see that the thing that *displays* as 'ti' is *coded* as Ɵ, as Don > > Osborn said. > > Ah. OK. Anyway this is not a Unicode problem. PDF knows nothing about > unicode. It uses the encoding of the fonts used. >
That's correct, however the PDF specs contain guidelines for naming glyphs in fonts in such a way that the encoding can be deciphered. This is needed for example in applications such as PDF forms where user input is expected. When those PDF are generated from rich text, the fonts used may be built with TrueType (without any glyph name in them, only mappings of sequences of codepoints) or OpenType or Postscript. When OpenType fonts contain Postscript glyphs, their names may be completely arbitrary, it does not even matter if the font used was mapped to Unciode or if it used a legacy or proprietary encoding). If you see a "Ɵ" when copy-pasting from the PDF, it's because the font used to produce it did not follow these guidelines (or did not specify any glyphname, in which case this is a sort of OCR algorithm that attempts to decipher the glyph : the "ti" ligature is visually extremely near from the "Ɵ", and an OCR has lot of difficulties to disguish them, unless they also use some linguistic dictionnary searches and some hints about the script used in surrounding characters to enhance the guess). Note that PDF's (or DejaVu's) are not required to contain only text, or they could just embed a scanned and compressed bitmap image (if you want to see how an OCR can be wrong, look at how it fails with lots of errors, for example in the decoding projects for Wikibooks, working with scanned bitmaps of old books: OCR is just an helper, but there's still lot of work to correct what has been guessed and reencode the correct text; even if humans are smarter than OCR, this is a lot of work to perform manually : encoding the text of a single scanned old book still takes one or two months for an experienced editor, and there are still many errors to review later by someone else) Most PDFs were not created with the idea of decoding later their rendered texts. In fact they were intended to be read or printed "as is", including with their styles, colors, and decorations of fonts everywhere or text over photos. They were even created to be non modifiable and used then for archival. Some PDF tools will also cleanup from the PDF the additional metadata such as the original fonts used, instead these PDFs will locally embed pseudo-fonts containing sets of glyphs from various fonts (in mixed styles), in random order or sorted by frequency of use in the document or by order of occurence in the original text. These embedded fonts are generated on the fly to contain only the necessary glyphs for the document. When those embedded fonts are generated, there's a compression algorithme that drops lots of things from the original font, including its metadata such as the original "Postscript" glyph names.