Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, why in the 1-many mapping of ligatures (for fonts that have them) do the "many" not simply consist of the characters ligated? Maybe that's too simple (my understanding of the process is clearly inadequate).

The "string of random ASCII characters" (per Leonardo) used in the Identity H system for hanzi raise other questions: (1) How are the ASCII characters interpreted as a 1-many sequence representing a hanzi rather than just a series of 1-1 mappings of themselves? (2) Why not just use the Unicode code point?

The details may or may not be relevant to the list topic, but as a user of documents in PDF format, I fail to see the benefit of such obscure mappings. And as a creator of PDFs ("save as") looking at others' PDFs I've just encountered with these mappings, I'm wondering how these concerned about how the font & mapping results turned out as they did. It is certain that the creators of the documents didn't intend results that would not be searchable by normal text, but it seems possible their a particular font choice with these ligatures unwittingly produced these results. If the latter, the software at the very least should show a caveat about such mappings when generating PDFs.

Maybe it's unrealistic to expect a simple implication of Unicode in PDFs (a topic we've discussed before but which I admit not fully grasping). Recalling I once had some wild results copy/pasting from an N'Ko PDF, and ended up having to obtain the .docx original to obtain text for insertion in a blog posting. But while it's not unsurprising to encounter issues with complex non-Latin scripts from PDFs, I'd gotten to expect predictability when dealing with most Latin text.

Don



On 3/17/2016 7:34 PM, Andrew Cunningham wrote:

There are a few things going on.

In the first instance, it may be the font itself that is the source of the problem.

My understanding is that PDF files contain a sequence of glyphs. A PDF file will contain a ToUnicode mapping between glyphs and codepoints. This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides support for ligatures and variation sequences.

I assume it uses the data in the font's cmap table. If the ligature isn't mapped then you will have problems. I guess the problem could be either the font or the font subsetting and embedding performed when the PDF is generated.

Although, it is worth noting that in opentype fonts not all glyphs will have mappings in the cmap file.

The remedy, is to extensively tag the PDF and add ActualText attributes to the tags.

But the PDF specs leave it up to the developer to decide what happens in there is both a visible text layer and ActualText. So even in an ideal PDF, tesults will vary from software to software when copying text or searching a PDF.

At least thatsmy current understanding.

Andrew

On 18 Mar 2016 7:47 am, "Don Osborn" <d...@bisharat.net <mailto:d...@bisharat.net>> wrote:

    Thanks all for the feedback.

    Doug, It may well be my clipboard (running Windows 7 on this
    particular laptop). Get same results pasting into Word and EmEditor.

    So, when I did a web search on "internaƟonal," as previously
    mentioned, and come up with a lot of results (mostly PDFs), were
    those also a consequence of many not fully Unicode compliant
    conversions by others?

    A web search on what you came up with - "Interna􀆟onal" - yielded
    many more (82k+) results, again mostly PDFs, with terms like
    "interna onal" (such as what Steve noted) and "interna<onal" and
    perhaps others (given the nature of, or how Google interprets, the
    private use character?).

    Searching within the PDF document already mentioned,
    "international" comes up with nothing (which is a major fail as
    far as usability). Searching the PDF in a Firefox browser window,
    only "internaƟonal" finds the occurrences of what displays as
    "international." However after downloading the document and
    searching it in Acrobat, only a search for "interna􀆟onal" will
    find what displays as "international."

    A separate web search on "Eīects" came up with 300+ results,
    including some GoogleBooks which in the texts display "effects"
    (as far as I checked). So this is not limited to Adobe?

    Jörg, With regard to "Identity H," a quick search gives the
    impression that this encoding has had a fairly wide and not so
    happy impact, even if on the surface level it may have facilitated
    display in a particular style of font in ways that no one
    complains about.

    Altogether a mess, from my limited encounter with it. There must
    have been a good reason for or saving grace of this solution?

    Don

    On 3/17/2016 2:17 PM, Steve Swales wrote:

        Yes, it seems like your mileage varies with the PDF
        viewer/interpreter/converter.  Text copied from Preview on the
        Mac replaces the ti ligature with a space.  Certainly not a
        Unicode problem, per se, but an interesting problem nevertheless.

        -steve

            On Mar 17, 2016, at 11:11 AM, Doug Ewell <d...@ewellic.org
            <mailto:d...@ewellic.org>> wrote:

            Don Osborn wrote:

                Odd result when copy/pasting text from a PDF: For some
                reason "ti" in
                the (English) text of the document at
                
http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
                is coded as "Ɵ". Looking more closely at the original
                text, it does
                appear that the glyph is a "ti" ligature (which afaik
                is not coded as
                such in Unicode).

            When I copy and paste the PDF text in question into
            BabelPad, I get:

                Interna􀆟onal Order and the Distribu􀆟on of Iden􀆟ty
                in 1950 (By
                invita􀆟on only)

            The "ti" ligatures are implemented as U+10019F, a Plane 16
            private-use
            character.

            Truncating this character to 16 bits, which is a Bad
            Thing™, yields
            U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it
            looks like either
            Don's clipboard or the editor he pasted it into is not fully
            Unicode-compliant.

            Don's point about using alternative characters to
            implement ligatures,
            thereby messing up web searches, remains valid.

            --
            Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸





Reply via email to