Re: [poppler] Working with Asian languages

2015-09-14 Thread Jason Crain
On 2015-09-13 20:06, Rob Hawkins wrote: Greetings all, Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and Vietnamese? I didn't see a language pack for any except Thai, and that one doesn't produce properly formatted characters for my source files. They're missing the vowel

Re: [poppler] Working with Asian languages

2015-09-14 Thread suzuki toshiya
Dear Rob, Poppler extracts the text from PDF via the serie of glyphs. Therefore, the scripts that the Unicode encode the characters as visible order, the first step of the text extraction is possible. However, some Asian scripts, especially Brahmic-based scripts, have very complicated layout

Re: [poppler] Working with Asian languages

2015-09-14 Thread Rob Hawkins
Thank you all for these great replies. I find the stuff about the unicode encoding order really interesting. And I too wish we could find more information about the as-yet unmapped Asian scripts. I was mistaken about the output of PDF.js. I thought I had viewed the HTML source and seen good

Re: [poppler] Working with Asian languages

2015-09-14 Thread Adrian Johnson
On 15/09/15 01:23, Jonathan Kew wrote: > On 14/9/15 16:40, Rob Hawkins wrote: >> Thank you all for these great replies. I find the stuff about the >> unicode encoding order really interesting. And I too wish we could find >> more information about the as-yet unmapped Asian scripts. >> >> I was

Re: [poppler] Working with Asian languages

2015-09-14 Thread Jonathan Kew
On 14/9/15 16:40, Rob Hawkins wrote: Thank you all for these great replies. I find the stuff about the unicode encoding order really interesting. And I too wish we could find more information about the as-yet unmapped Asian scripts. I was mistaken about the output of PDF.js. I thought I had

[poppler] Working with Asian languages

2015-09-13 Thread Rob Hawkins
Greetings all, Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and Vietnamese? I didn't see a language pack for any except Thai, and that one doesn't produce properly formatted characters for my source files. They're missing the vowel marks. The other languages fail