Re: [poppler] Working with Asian languages

2015-09-14 Thread Jason Crain

On 2015-09-13 20:06, Rob Hawkins wrote:

Greetings all,

Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
Vietnamese?  I didn't see a language pack for any except Thai, and
that one doesn't produce properly formatted characters for my source
files.  They're missing the vowel marks.  The other languages fail
completely on my setup.  I've tried on OS X and Ubuntu 12.

My source files are here:
https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf

Chinese seems to work fine.

I found out that PDF.js will produce good output, though I already
have code based on pdftohtml output and would rather not switch if not
necessary.  I wonder if there is something wrong with my setup.

Thanks for any help even if it's just a "nope, that's not possible"
kind of reply =)

Rob


pdftohtml can work with those languages but it depends the ability to
extract the plain text from the document.  From the couple of PDFs I've
looked at, they have problems with text extraction.  Possibly poppler
could do a better job, but as several application I tried have problems
extracting text from those documents, it's probably just a problem with
those documents.

I assume that PDF.js just works in a different way and doesn't require
the extracted text to be correct.
___
poppler mailing list
poppler@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/poppler


Re: [poppler] Working with Asian languages

2015-09-14 Thread suzuki toshiya
Dear Rob,

Poppler extracts the text from PDF via the serie of glyphs.
Therefore, the scripts that the Unicode encode the characters
as visible order, the first step of the text extraction is
possible.

However, some Asian scripts, especially Brahmic-based scripts,
have very complicated layout rules, so, the encoding order
in Unicode text is phonetic and different from the visible
order (e.g. coded characters are in consonant-then-vowel order,
but the displayed characters are in vowel-then-consonant order).

In such case, the character serie extracted via the glyph serie
is not good coded text.

I'm not sure which script you assume for Indonesian (Latin?
Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
only Thai script is coded in visible order. Other scripts
have vowel-then-consonant encoding issue, so, it is not easy
for Poppler to extract the text in correct "Unicode" text.
Therefore, the result you have (Thai is OK, others are not)
sounds reasonable.

I'm unfamiliar with the bleeding-edge technology in the latedt
PDF about how to deal with such complex script (I guess PDF
developers are willing to support such), but, the PDFs made
by old PDF production softwares may have similar problem.

I wish some Adobe experts mentions about the situation in the
latest PDF for complex scripts :-)

Regards,
mpsuzuki

Rob Hawkins wrote:
> Greetings all,
> 
> Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
> Vietnamese?  I didn't see a language pack for any except Thai, and that one
> doesn't produce properly formatted characters for my source files.  They're
> missing the vowel marks.  The other languages fail completely on my setup.
> I've tried on OS X and Ubuntu 12.
> 
> My source files are here:
> https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
> 
> Chinese seems to work fine.
> 
> I found out that PDF.js will produce good output, though I already have
> code based on pdftohtml output and would rather not switch if not
> necessary.  I wonder if there is something wrong with my setup.
> 
> Thanks for any help even if it's just a "nope, that's not possible" kind of
> reply =)
> 
> Rob
> 
> 
> 
> 
> 
> ___
> poppler mailing list
> poppler@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler

___
poppler mailing list
poppler@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/poppler


Re: [poppler] Working with Asian languages

2015-09-14 Thread Rob Hawkins
Thank you all for these great replies.  I find the stuff about the unicode
encoding order really interesting.  And I too wish we could find more
information about the as-yet unmapped Asian scripts.

I was mistaken about the output of PDF.js.  I thought I had viewed the HTML
source and seen good data, how exciting!  Yet now I that I double check, I
see it is just the viewer that is correct, and the source text is garbled
just like pdftotext etc.

I'm bummed there is no magic solution here as I thought I had found, but
glad to see people are still interested in this.  If I find out how to
implement these languages, I will try.  Alternatively, can we band together
to destroy PDFs everywhere?  If we work in concert it may be possible. =)

Thanks again,

Rob

On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya 
wrote:

> Dear Rob,
>
> Poppler extracts the text from PDF via the serie of glyphs.
> Therefore, the scripts that the Unicode encode the characters
> as visible order, the first step of the text extraction is
> possible.
>
> However, some Asian scripts, especially Brahmic-based scripts,
> have very complicated layout rules, so, the encoding order
> in Unicode text is phonetic and different from the visible
> order (e.g. coded characters are in consonant-then-vowel order,
> but the displayed characters are in vowel-then-consonant order).
>
> In such case, the character serie extracted via the glyph serie
> is not good coded text.
>
> I'm not sure which script you assume for Indonesian (Latin?
> Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
> only Thai script is coded in visible order. Other scripts
> have vowel-then-consonant encoding issue, so, it is not easy
> for Poppler to extract the text in correct "Unicode" text.
> Therefore, the result you have (Thai is OK, others are not)
> sounds reasonable.
>
> I'm unfamiliar with the bleeding-edge technology in the latedt
> PDF about how to deal with such complex script (I guess PDF
> developers are willing to support such), but, the PDFs made
> by old PDF production softwares may have similar problem.
>
> I wish some Adobe experts mentions about the situation in the
> latest PDF for complex scripts :-)
>
> Regards,
> mpsuzuki
>
> Rob Hawkins wrote:
> > Greetings all,
> >
> > Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
> > Vietnamese?  I didn't see a language pack for any except Thai, and that
> one
> > doesn't produce properly formatted characters for my source files.
> They're
> > missing the vowel marks.  The other languages fail completely on my
> setup.
> > I've tried on OS X and Ubuntu 12.
> >
> > My source files are here:
> > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
> >
> > Chinese seems to work fine.
> >
> > I found out that PDF.js will produce good output, though I already have
> > code based on pdftohtml output and would rather not switch if not
> > necessary.  I wonder if there is something wrong with my setup.
> >
> > Thanks for any help even if it's just a "nope, that's not possible" kind
> of
> > reply =)
> >
> > Rob
> >
> >
> >
> > 
> >
> > ___
> > poppler mailing list
> > poppler@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
>
>
___
poppler mailing list
poppler@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/poppler


Re: [poppler] Working with Asian languages

2015-09-14 Thread Adrian Johnson
On 15/09/15 01:23, Jonathan Kew wrote:
> On 14/9/15 16:40, Rob Hawkins wrote:
>> Thank you all for these great replies.  I find the stuff about the
>> unicode encoding order really interesting.  And I too wish we could find
>> more information about the as-yet unmapped Asian scripts.
>>
>> I was mistaken about the output of PDF.js.  I thought I had viewed the
>> HTML source and seen good data, how exciting!  Yet now I that I double
>> check, I see it is just the viewer that is correct, and the source text
>> is garbled just like pdftotext etc.
>>
>> I'm bummed there is no magic solution here as I thought I had found, but
>> glad to see people are still interested in this.  If I find out how to
>> implement these languages, I will try.
> 
> I think what you're looking for is the ActualText feature in PDF. If
> this is present, a viewer or text-extraction tool can use it to provide
> the correct text, instead of trying to reconstruct the text from the
> stream of glyphs in the PDF -- which, while it often works OK for
> European languages and similar "simple" writing systems, is pretty much
> doomed to failure for complex South/Southeast Asian scripts, etc.
> 
> But this is dependent on the PDF-generating tool or workflow including
> the correct ActualText attributes in the first place. In my (very
> limited) experience, this is pretty rare.

Poppler has supported ActualText when extracting text since 2008. I
added this to poppler when I added ActualText generation to cairo.
Application support for this appears to be rare.  I'm not aware of any
cairo application that uses the cairo_show_text_glyphs() API for
generating ActualText entries.


> 
> JK
> 
>> Alternatively, can we band
>> together to destroy PDFs everywhere?  If we work in concert it may be
>> possible. =)
>>
>> Thanks again,
>>
>> Rob
>>
>> On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya
>> > wrote:
>>
>> Dear Rob,
>>
>> Poppler extracts the text from PDF via the serie of glyphs.
>> Therefore, the scripts that the Unicode encode the characters
>> as visible order, the first step of the text extraction is
>> possible.
>>
>> However, some Asian scripts, especially Brahmic-based scripts,
>> have very complicated layout rules, so, the encoding order
>> in Unicode text is phonetic and different from the visible
>> order (e.g. coded characters are in consonant-then-vowel order,
>> but the displayed characters are in vowel-then-consonant order).
>>
>> In such case, the character serie extracted via the glyph serie
>> is not good coded text.
>>
>> I'm not sure which script you assume for Indonesian (Latin?
>> Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
>> only Thai script is coded in visible order. Other scripts
>> have vowel-then-consonant encoding issue, so, it is not easy
>> for Poppler to extract the text in correct "Unicode" text.
>> Therefore, the result you have (Thai is OK, others are not)
>> sounds reasonable.
>>
>> I'm unfamiliar with the bleeding-edge technology in the latedt
>> PDF about how to deal with such complex script (I guess PDF
>> developers are willing to support such), but, the PDFs made
>> by old PDF production softwares may have similar problem.
>>
>> I wish some Adobe experts mentions about the situation in the
>> latest PDF for complex scripts :-)
>>
>> Regards,
>> mpsuzuki
>>
>> Rob Hawkins wrote:
>>  > Greetings all,
>>  >
>>  > Can pdftohtml produce output for Burmese, Khmer, Indonesian,
>> Thai and
>>  > Vietnamese?  I didn't see a language pack for any except Thai,
>> and that one
>>  > doesn't produce properly formatted characters for my source
>> files.  They're
>>  > missing the vowel marks.  The other languages fail completely on
>> my setup.
>>  > I've tried on OS X and Ubuntu 12.
>>  >
>>  > My source files are here:
>>  > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
>>  >
>>  > Chinese seems to work fine.
>>  >
>>  > I found out that PDF.js will produce good output, though I
>> already have
>>  > code based on pdftohtml output and would rather not switch if not
>>  > necessary.  I wonder if there is something wrong with my setup.
>>  >
>>  > Thanks for any help even if it's just a "nope, that's not
>> possible" kind of
>>  > reply =)
>>  >
>>  > Rob
>>  >
>>  >
>>  >
>>  >
>>
>> 
>>  >
>>  > ___
>>  > poppler mailing list
>>  > poppler@lists.freedesktop.org
>> 
>>  > http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>>
>>
>>
>> ___
>> poppler mailing list
>> 

Re: [poppler] Working with Asian languages

2015-09-14 Thread Jonathan Kew

On 14/9/15 16:40, Rob Hawkins wrote:

Thank you all for these great replies.  I find the stuff about the
unicode encoding order really interesting.  And I too wish we could find
more information about the as-yet unmapped Asian scripts.

I was mistaken about the output of PDF.js.  I thought I had viewed the
HTML source and seen good data, how exciting!  Yet now I that I double
check, I see it is just the viewer that is correct, and the source text
is garbled just like pdftotext etc.

I'm bummed there is no magic solution here as I thought I had found, but
glad to see people are still interested in this.  If I find out how to
implement these languages, I will try.


I think what you're looking for is the ActualText feature in PDF. If 
this is present, a viewer or text-extraction tool can use it to provide 
the correct text, instead of trying to reconstruct the text from the 
stream of glyphs in the PDF -- which, while it often works OK for 
European languages and similar "simple" writing systems, is pretty much 
doomed to failure for complex South/Southeast Asian scripts, etc.


But this is dependent on the PDF-generating tool or workflow including 
the correct ActualText attributes in the first place. In my (very 
limited) experience, this is pretty rare.


JK

> Alternatively, can we band

together to destroy PDFs everywhere?  If we work in concert it may be
possible. =)

Thanks again,

Rob

On Mon, Sep 14, 2015 at 9:22 PM, suzuki toshiya
> wrote:

Dear Rob,

Poppler extracts the text from PDF via the serie of glyphs.
Therefore, the scripts that the Unicode encode the characters
as visible order, the first step of the text extraction is
possible.

However, some Asian scripts, especially Brahmic-based scripts,
have very complicated layout rules, so, the encoding order
in Unicode text is phonetic and different from the visible
order (e.g. coded characters are in consonant-then-vowel order,
but the displayed characters are in vowel-then-consonant order).

In such case, the character serie extracted via the glyph serie
is not good coded text.

I'm not sure which script you assume for Indonesian (Latin?
Javanese? Balinese?), but, among Thai, Burmese, Khmer scripts,
only Thai script is coded in visible order. Other scripts
have vowel-then-consonant encoding issue, so, it is not easy
for Poppler to extract the text in correct "Unicode" text.
Therefore, the result you have (Thai is OK, others are not)
sounds reasonable.

I'm unfamiliar with the bleeding-edge technology in the latedt
PDF about how to deal with such complex script (I guess PDF
developers are willing to support such), but, the PDFs made
by old PDF production softwares may have similar problem.

I wish some Adobe experts mentions about the situation in the
latest PDF for complex scripts :-)

Regards,
mpsuzuki

Rob Hawkins wrote:
 > Greetings all,
 >
 > Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
 > Vietnamese?  I didn't see a language pack for any except Thai,
and that one
 > doesn't produce properly formatted characters for my source
files.  They're
 > missing the vowel marks.  The other languages fail completely on
my setup.
 > I've tried on OS X and Ubuntu 12.
 >
 > My source files are here:
 > https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf
 >
 > Chinese seems to work fine.
 >
 > I found out that PDF.js will produce good output, though I
already have
 > code based on pdftohtml output and would rather not switch if not
 > necessary.  I wonder if there is something wrong with my setup.
 >
 > Thanks for any help even if it's just a "nope, that's not
possible" kind of
 > reply =)
 >
 > Rob
 >
 >
 >
 >

 >
 > ___
 > poppler mailing list
 > poppler@lists.freedesktop.org 
 > http://lists.freedesktop.org/mailman/listinfo/poppler




___
poppler mailing list
poppler@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/poppler



___
poppler mailing list
poppler@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/poppler


[poppler] Working with Asian languages

2015-09-13 Thread Rob Hawkins
Greetings all,

Can pdftohtml produce output for Burmese, Khmer, Indonesian, Thai and
Vietnamese?  I didn't see a language pack for any except Thai, and that one
doesn't produce properly formatted characters for my source files.  They're
missing the vowel marks.  The other languages fail completely on my setup.
I've tried on OS X and Ubuntu 12.

My source files are here:
https://github.com/robhawkins/drive-taiwan/tree/master/input/pdf

Chinese seems to work fine.

I found out that PDF.js will produce good output, though I already have
code based on pdftohtml output and would rather not switch if not
necessary.  I wonder if there is something wrong with my setup.

Thanks for any help even if it's just a "nope, that's not possible" kind of
reply =)

Rob
___
poppler mailing list
poppler@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/poppler