[
https://issues.apache.org/jira/browse/PDFBOX-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16384220#comment-16384220
]
Julien Férard commented on PDFBOX-4138:
---------------------------------------
I don't have any test file. I was just trying to understand the class (this is
not easy!), with the idea to subclass it. The code is a bit complex, but I
think the idea on beads is clear.
h3. The context
The charactersByArticle contains textLists (lists of text pos). It's size is 2
times the number of beads + 1, and it's content is:
- textList before bead 0
- textList in bead 0
- ...
- textList in bead n
- textList after bead n
(see
[https://github.com/apache/pdfbox/blob/0e07344c0e3a932f0ca346f7cac4700882c67b5d/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L151])
h3. The portion of code around the line 844
When examining a text pos (among other things), the portion of code tries to
find to which textList it belongs. It does this (pseudo code):
{code}
i=-1
for bead in beads:
if bead contains text pos:
i = bead index * 2 + 1
break
if i == -1: # bead not found
for bead in beads:
if text pos "before" bead: # there are 3 levels of before in the code
i = bead index * 2
break
if i == -1: # before bead not found
i = charactersByArticle.get(size-1) # after last bead
# use charactersByArticle.get(i)
{code}
(Actually, this is done in one loop in the code.)
The question is: what is "before"? The answer in the code (*excluding the
suspicious ||*) is: there are 3 levels of "before": "on the left and above" a
bead is the best, but if no bead suits, then "on the left" is ok, and finally
"above" is the worst. It doesn't make sense to have "on the left OR above"
better than "on the left".
h3. Aside
Another way to see that the code is wrong is the following: we enter the two
blocks at the lines 849-857 only if
{{notFoundButFirstLeftAndAboveArticleDivisionIndex != -1}}, because : 1. {{else
if}} => we need to have {{(x >= rect.getLowerLeftX() && y >=
rect.getUpperRightY()) || notFoundButFirstLeftAndAboveArticleDivisionIndex !=
-1)}} 2. the tests check either {{x < rect.getLowerLeftX()}} or {{y <
rect.getUpperRightY()}}, which are both false if {{x >= rect.getLowerLeftX() &&
y >= rect.getUpperRightY()}}.
Thus, {{notFoundButFirstLeftArticleDivisionIndex}} and
{{notFoundButFirstAboveArticleDivisionIndex}} may be != -1 only if
{{notFoundButFirstLeftAndAboveArticleDivisionIndex}} is already != -1. And
conversely, if {{notFoundButFirstLeftAndAboveArticleDivisionIndex == -1}}, then
{{notFoundButFirstLeftArticleDivisionIndex == -1}} and
{{notFoundButFirstAboveArticleDivisionIndex == -1}}.
Why those tests at lines 879-886? We already know that
{{notFoundButFirstLeftArticleDivisionIndex}} and
{{notFoundButFirstAboveArticleDivisionIndex == -1}} because
{{notFoundButFirstLeftAndAboveArticleDivisionIndex == -1}}! A part of the code
is unreachable...
This doesn't mean that the result will be better if the code is more
consistent. I don't know...
> PDFTextStripper: error in a comparison
> --------------------------------------
>
> Key: PDFBOX-4138
> URL: https://issues.apache.org/jira/browse/PDFBOX-4138
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.8
> Reporter: Julien Férard
> Priority: Minor
>
> This is very simple. Maybe I'm wrong, but in PdfTextStripper, l. 844
>
> [https://github.com/apache/pdfbox/blob/0e07344c0e3a932f0ca346f7cac4700882c67b5d/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L844]
> * You want to check if the pos is on the left *and* above the rectangle
> (this is better than just on the left or just above);
> * The name of the variable contains "LeftAndAbove".
> ...and the code contains a `||` (or).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]