Bugs in extraction of text from Arabic PDFs

yigal dayan Sun, 04 Apr 2010 03:38:09 -0700

Hello,

I want to report 3 bugs in text extraction. One is general and two are specific 
to Arabic PDFs.


(1) Performance issue
  Relevant to : pdfBox 1.1.0, pdfBox 1.0.0

The method normalizeDiac in org.apache.pdfbox.util.ICU4JImpl constructs a 
string from characters. It should use StringBuilder instead of String.



(2) Incorrect ordering of compound Arabic glyphs
  Relevant to : pdfBox 1.1.0, pdfBox 1.0.0

Some Arabic PDFs contain compound glyphs for stylistic reasons. Such glyphs 
encode two letters: FI, SI, LI, LJ, LM, etc.

Before a line gets sent to the bidirectional algorithm, all characters have 
been sorted into a visual order, except for these pairs. This is because they 
are handled as one unit and maintain their original (logical) order. The bidi 
algorithm straightens out most characters, but reverses the glyph pairs.

To fix this, the output of font.encode() should be examined and
reversed on the spot if it contains pairs of Arabic characters. Possibly you 
need to add a stub method to PDFStreamEngine that PDFTextStripper can override 
(in sort mode only).



(3) Corruption of Arabic output due to Japanese bug fix.
   Relevant to : pdfBox 1.1.0

The recent Japanese bug fix in org.apache.pdfbox.pdmodel.font.PDFont
defines a set of encoding names that are given special CJK treatment. This set 
is too broad. For example, it stipulates that the 'Identity-H' encoding should 
be processed as JIS.

We have Arabic PDFs where compound Arabic glyphs use the 'Identity-H' encoding. 
In pdfBox 1.0.0 they used to output Arabic but now they output garbage, because 
the Arabic unicode data is sent to the CJK converter.


Thanks,
Yigal Dayan

Bugs in extraction of text from Arabic PDFs

Reply via email to