Pdf with Times New Roman and Cyrillic glyphs = weird characters extracted

Anton Stoychev Thu, 27 Mar 2014 12:04:31 -0700

So the problematic pdf is this:
http://www.parliament.bg/pub/StenD/iv260712.pdf


The first time I opened it in Adobe Reader the entries in the first column
showed as garbled glyphs like ȺɅȿɄɋȺɇȾɔɊɊɍɆȿɇɈȼɇȿɇɄɈ.

Then I installed Times New Roman font family on my Fedora machine and I
restarted Adobe Reader. This fixed and I was able to see correct names like
"АЛЕКСАНДЪР РУМЕНОВ НЕНКОВ"

This are names persons' names in Cyrillic.

I'm using PDFBox along with tabula-extractor (
https://github.com/jazzido/tabula-extractor) to extract table data but it
seems even with Times New Roman installed on my machine, the names are
still garbled:

ȺɅȿɄɋȺɇȾɔɊȾɂɆɂɌɊɈȼɉȺɍɇɈȼ,740,ɄȻ,-,+,+,0,0,+,+,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-
ȺɅȿɄɋȺɇȾɔɊɏɊɂɋɌɈȼɆȿɌɈȾɂȿȼ,917,Ⱦɉɋ,0,0,0,=,=,+,+,-,+,+,-,-,-,-,+,+,+,+,+,+,-,+,+,-,-,-
ȺɅȿɄɋɂȼȺɋɂɅȿȼȺɅȿɄɋɂȿȼ,919,ɄȻ,-,+,+,-,-,+,+,0,0,+,-,-,-,-,+,+,-,+,0,+,-,+,+,-,-,-
ȺɅɂɈɋɆȺɇɂȻɊȺɂɆɂɆȺɆɈȼ,336,Ⱦɉɋ,0,+,+,-,-,+,+,-,+,+,-,-,-,-,+,0,0,0,0,0,0,0,0,0,0,-
ȺɇȾɈɇɉȿɌɊɈȼȺɇȾɈɇɈȼ,856,ȽȿɊȻ,+,=,-,+,+,=,-,+,=,0,0,0,0,0,0,-,+,-,-,-,0,-,-,+,+,0
ȺɇɌɈɇɄɈɇɋɌȺɇɌɂɇɈȼɄɍɌȿȼ,343,ɄȻ,0,0,0,-,-,+,+,-,+,+,0,-,-,-,+,+,-,+,+,+,-,+,0,0,-,-
ȺɇɌɈɇɂɃɃɈɊȾȺɇɈȼɃɈɊȾȺɇɈȼ,604,ȽȿɊȻ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ȺɌȺɇȺɋɁȺɎɂɊɈȼɁȺɎɂɊɈȼ,744,ɄȻ,-,+,+,-,-,+,+,0,+,+,-,-,-,-,+,+,-,+,0,+,-,+,+,-,-,-
ȺɌȺɇȺɋɂȼȺɇɈȼɌȺɒɄɈȼ,857,ȽȿɊȻ,+,=,=,+,+,-,=,0,0,0,0,0,0,0,0,-,+,-,-,0,0,0,0,0,0,0

Is this something to do with glyphlist_ext described
http://pdfbox.apache.org/cookbook/textextraction.html#external-glyph-list ?

I tried PDFont font = PDTrueTypeFont.loadTTF(document, "Times New Roman.ttf"
);

It didn't do anything.

Am I doing something wrong? How can I fix this?

Best Regards,

Anton

Pdf with Times New Roman and Cyrillic glyphs = weird characters extracted

Reply via email to