Re: [Podofo-users] reading polish characters using PoDoFo

Etienne Robin Wed, 01 Feb 2017 01:26:40 -0800

Hi,

I think that the problem in the test.pdf sample comes from the ConvertToUnicode 
implementation.


In the sample, there is an embedded font and a ToUnicode table, 
PdfCMapEncoding::ConvertToUnicode is called.
In v 0.9.4, its implementation is a call to the base class method 
PdfEncoding::ConvertToUnicode.

PdfEncoding::ConvertToUnicode implementation assumes that the encoded string 
has 2 bytes per character code. It is not the case here: the encoding is a 
single byte encoding.

There is a document from Adobe “5014.CIDFont_spec.pdf”,
http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf
 where cmap tables are described. In section 5.2 / Codespace, it is written


"The CMap file fully describes the potential set of valid input character code 
values. Input codes may consist of one, two, three, or more hexadecimal bytes, 
expressed between < > brackets, Ranges need not be contiguous, but cannot 
overlap. The codespace definition unambiguously specifies which input codes 
consist of one byte, which consist of two, and so forth. "

So, single byte encoding seems to me a valid encoding, but is not supported by 
PdfEncoding::ConvertToUnicode. Returned strings contain only 0 values / empty 
strings.

I’m still new to the PDF format and PoDoFo, can someone else confirm this ?

Best regards,

Etienne


On 31 Jan 2017, at 19:46, zyx <z...@litepdf.cz<mailto:z...@litepdf.cz>> wrote:

On Tue, 2017-01-31 at 11:59 +0100, zyx wrote:
I can give my pdf file here if needed.

As long as you'll not expose any private data with it.

Hi,
as there had been private data involved, Fryderyk sent the file to me
privately and I see that with the svn trunk at revision 1818 the
podofotxtextract doesn't crash here, it also prints several lines where
the text is supposed to be, but it doesn't decode the text to unicode
properly. All the converted to-unicode texts are zero-length.

I can reproduce the same output (no text printed) with the attached
file from LibreOffice.

PoDoFo 0.9.4 (revision 1764) fails in the same way, thus it's no
regression from the previous version.

PoDoFo 0.9.3 (revision 1650) fails with an error "Found text but do not
have a current font", thus there is a little improvement in 0.9.4,
though still not working.

Dom, what is your opinion with respect of the 0.9.5 release?

Bye,
zyx

--
http://www.litePDF.cz<http://www.litepdf.cz>                                 
i...@litepdf.cz<mailto:i...@litepdf.cz><test.pdf>------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org<http://slashdot.org>! 
http://sdm.link/slashdot_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net<mailto:Podofo-users@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/podofo-users

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot

_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Re: [Podofo-users] reading polish characters using PoDoFo

Reply via email to