Hello,
I want to report 3 bugs in text extraction. One is general and two are specific
to Arabic PDFs.
(1) Performance issue
Relevant to : pdfBox 1.1.0, pdfBox 1.0.0
The method normalizeDiac in org.apache.pdfbox.util.ICU4JImpl constructs a
string from characters. It should use StringBuilder instead of String.
(2) Incorrect ordering of compound Arabic glyphs
Relevant to : pdfBox 1.1.0, pdfBox 1.0.0
Some Arabic PDFs contain compound glyphs for stylistic reasons. Such glyphs
encode two letters: FI, SI, LI, LJ, LM, etc.
Before a line gets sent to the bidirectional algorithm, all characters have
been sorted into a visual order, except for these pairs. This is because they
are handled as one unit and maintain their original (logical) order. The bidi
algorithm straightens out most characters, but reverses the glyph pairs.
To fix this, the output of font.encode() should be examined and
reversed on the spot if it contains pairs of Arabic characters. Possibly you
need to add a stub method to PDFStreamEngine that PDFTextStripper can override
(in sort mode only).
(3) Corruption of Arabic output due to Japanese bug fix.
Relevant to : pdfBox 1.1.0
The recent Japanese bug fix in org.apache.pdfbox.pdmodel.font.PDFont
defines a set of encoding names that are given special CJK treatment. This set
is too broad. For example, it stipulates that the 'Identity-H' encoding should
be processed as JIS.
We have Arabic PDFs where compound Arabic glyphs use the 'Identity-H' encoding.
In pdfBox 1.0.0 they used to output Arabic but now they output garbage, because
the Arabic unicode data is sent to the CJK converter.
Thanks,
Yigal Dayan