On 15/09/15 01:23, Jonathan Kew wrote: > On 14/9/15 16:40, Rob Hawkins wrote: >> Thank you all for these great replies. I find the stuff about the >> unicode encoding order really interesting. And I too wish we could find >> more information about the as-yet unmapped Asian scripts. >> >> I was mistaken about the output of PDF.js. I thought I had viewed the >> HTML source and seen good data, how exciting! Yet now I that I double >> check, I see it is just the viewer that is correct, and the source text >> is garbled just like pdftotext etc. >> >> I'm bummed there is no magic solution here as I thought I had found, but >> glad to see people are still interested in this. If I find out how to >> implement these languages, I will try. > > I think what you're looking for is the ActualText feature in PDF. If > this is present, a viewer or text-extraction tool can use it to provide > the correct text, instead of trying to reconstruct the text from the > stream of glyphs in the PDF -- which, while it often works OK for > European languages and similar "simple" writing systems, is pretty much > doomed to failure for complex South/Southeast Asian scripts, etc. > > But this is dependent on the PDF-generating tool or workflow including > the correct ActualText attributes in the first place. In my (very > limited) experience, this is pretty rare.
Poppler has supported ActualText when extracting text since 2008. I added this to poppler when I added ActualText generation to cairo. Application support for this appears to be rare. I'm not aware of any cairo application that uses the cairo_show_text_glyphs() API for generating ActualText entries. > > JK > >> Alternatively, can we band >> together to destroy PDFs everywhere? If we work in concert it may be >> possible. =) >> >> Thanks again, >> >> Rob >> >> On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya >> <mpsuz...@hiroshima-u.ac.jp <mailto:mpsuz...@hiroshima-u.ac.jp>> wrote: >> >> Dear Rob, >> >> Poppler extracts the text from PDF via the serie of glyphs. >> Therefore, the scripts that the Unicode encode the characters >> as visible order, the first step of the text extraction is >> possible. >> >> However, some Asian scripts, especially Brahmic-based scripts, >> have very complicated layout rules, so, the encoding order >> in Unicode text is phonetic and different from the visible >> order (e.g. coded characters are in consonant-then-vowel order, >> but the displayed characters are in vowel-then-consonant order). >> >> In such case, the character serie extracted via the glyph serie >> is not good coded text. >> >> I'm not sure which script you assume for Indonesian (Latin? >> Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts, >> only Thai script is coded in visible order. Other scripts >> have vowel-then-consonant encoding issue, so, it is not easy >> for Poppler to extract the text in correct "Unicode" text. >> Therefore, the result you have (Thai is OK, others are not) >> sounds reasonable. >> >> I'm unfamiliar with the bleeding-edge technology in the latedt >> PDF about how to deal with such complex script (I guess PDF >> developers are willing to support such), but, the PDFs made >> by old PDF production softwares may have similar problem. >> >> I wish some Adobe experts mentions about the situation in the >> latest PDF for complex scripts :-) >> >> Regards, >> mpsuzuki >> >> Rob Hawkins wrote: >> > Greetings all, >> > >> > Can pdftohtml produce output for Burmese, Khmer, Indonesian, >> Thai and >> > Vietnamese? I didn't see a language pack for any except Thai, >> and that one >> > doesn't produce properly formatted characters for my source >> files. They're >> > missing the vowel marks. The other languages fail completely on >> my setup. >> > I've tried on OS X and Ubuntu 12. >> > >> > My source files are here: >> > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf >> > >> > Chinese seems to work fine. >> > >> > I found out that PDF.js will produce good output, though I >> already have >> > code based on pdftohtml output and would rather not switch if not >> > necessary. I wonder if there is something wrong with my setup. >> > >> > Thanks for any help even if it's just a "nope, that's not >> possible" kind of >> > reply =) >> > >> > Rob >> > >> > >> > >> > >> >> ------------------------------------------------------------------------ >> > >> > _______________________________________________ >> > poppler mailing list >> > poppler@lists.freedesktop.org >> <mailto:poppler@lists.freedesktop.org> >> > http://lists.freedesktop.org/mailman/listinfo/poppler >> >> >> >> >> _______________________________________________ >> poppler mailing list >> poppler@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/poppler >> > > _______________________________________________ > poppler mailing list > poppler@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list poppler@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/poppler