On 12 Jan, Dan Jacobson wrote:
> Package: xpdf-utils
> Version: 3.01-9
> X-debbugs-cc: [EMAIL PROTECTED]
> Severity: normal
> File: /usr/bin/pdftotext
> 
> What's the deal with the repeated characters?
> 
> wget http://www.x-net.idv.tw/download/swlfreq/CHNB06.pdf
> pdftotext -layout -enc Big5 CHNB06.pdf
> Error: No paper information available - using defaults
> iconv -f big5 CHNB06.txt|grep 遭
> B.B.C. ( 英中廣播漢台 )* 遭大陸遭遭遭遭 B06
> V.O.A.   ( 美中中台 )* 遭大陸遭遭遭遭 B06
> R.FREE   ASIA ( 自自亞洲漢伊 ) *遭大陸遭遭遭遭 B06
> 註:RTI遭遭中中大陸遭遭遭遭, 請請請接請請時停請請文冬!
> 
> Any why the wide characters when the look narrow in xpdf?
> 
> Phew, shook of the wides with
> perl -C -pwMText::Unidecode -e 's/\P{Han}+/unidecode($&)/eg'

The fonts in that PDF file have ToUnicode maps, which means Xpdf is
using those for conversion.

I took a quick look and several glyphs map to U+906d (which is the
repeated character).  I suspect that whatever software generated this
PDF file ("Neevia docuPrinter LT") created incorrect ToUnicode maps.
That would also explain the wide Roman characters.

- Derek



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to