I agree that it's a real-world problem -- PDFs really should be searchable -- but I do not see that it's a Unicode issue. Unicode defines the basic building blocks of LATIN SMALL LETTER T and LATIN SMALL LETTER I; that's its job. Unicode is not responsible for font construction or creating PDF software. Furthermore, even if Unicode did want to do something about it, I can't imagine what that could be -- aside perhaps from using its bully pulpit to urge PDF creators and font creators to do their jobs better.

The fact that some PDF apps do not search and copy/paste text correctly when unencoded characters are given PUA values has been known for many years. In the case of Calibri, I looked at the font (version installed on my Win7 system) and found that the 'ti' ligature is named t_i, which follows good naming practices, and it does not have a PUA assignment. Given this, any well-constructed PDF app should be able to decode the ligature correctly.

David

On 5/6/2016 11:49 AM, Steve Swales wrote:
This discussion seems to have fizzled out, but I’m concerned that
there’s a real world problem here which is at least partially the
concern of the consortium, so let me stir the pot and see if there’s
still any meat left.

On the current release of MacOS (including the developer beta, for
your reference, Peter), if you use Calibri font, for example, in any
app (e.g. notes), to write words with “ti” (like
internationalization), then press “Print" and “Open PDF in Preview”,
you get a PDF document with the joined “ti”.  Subsequently cutting and
pasting produces mojibake, and searching the document for words
with“ti” doesn’t work, as previously noted.

I suppose we can look on this as purely a font handling/MacOS bug, but
I’m wondering if we should be providing accommodations or conveniences
in Unicode for it to work as desired.

-steve

Reply via email to