Bug#406568: pdftotext: repeated chars

Derek B. Noonburg Tue, 13 Feb 2007 17:57:37 -0800

On 12 Jan, Dan Jacobson wrote:
> Package: xpdf-utils
> Version: 3.01-9
> X-debbugs-cc: [EMAIL PROTECTED]
> Severity: normal
> File: /usr/bin/pdftotext
> 
> What's the deal with the repeated characters?
> 
> wget http://www.x-net.idv.tw/download/swlfreq/CHNB06.pdf
> pdftotext -layout -enc Big5 CHNB06.pdf
> Error: No paper information available - using defaults
> iconv -f big5 CHNB06.txt|grep 遭
> Ｂ．Ｂ．Ｃ．　（　英中廣播漢台　）＊　遭大陸遭遭遭遭　Ｂ０６
> Ｖ．Ｏ．Ａ．　　　（　美中中台　）＊　遭大陸遭遭遭遭　Ｂ０６
> Ｒ．ＦＲＥＥ　　　ＡＳＩＡ　（　自自亞洲漢伊　）　＊遭大陸遭遭遭遭 Ｂ０６
> 註：ＲＴＩ遭遭中中大陸遭遭遭遭，　請請請接請請時停請請文冬！
> 
> Any why the wide characters when the look narrow in xpdf?
> 
> Phew, shook of the wides with
> perl -C -pwMText::Unidecode -e 's/\P{Han}+/unidecode($&)/eg'


The fonts in that PDF file have ToUnicode maps, which means Xpdf is
using those for conversion.

I took a quick look and several glyphs map to U+906d (which is the
repeated character).  I suspect that whatever software generated this
PDF file ("Neevia docuPrinter LT") created incorrect ToUnicode maps.
That would also explain the wide Roman characters.

- Derek



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#406568: pdftotext: repeated chars

Reply via email to