Am 06.04.19 um 17:19 schrieb Tilman Hausherr:
I looked at about 10 files... all are rotated. I suspect this is a result of PDFBOX-4480, that previously some rotated words came as one. But this doesn't matter, the overall extraction of rotated pages would still look bad.
PDFBOX-4480 is one reason but it isn't the only one. The unsorted results without those changes are different than those from 2.0.14.


For example, the file you mention extracted this in 2.0.14:

...
R
E
R
M
H
IV
-1
infection
hum
an(B
8)
[G
oulder97c]
...

So it had "infection" but the rest was still worthless. The same file extracts nicely with the "rotationMagic" option of ExtractText.
I agree with Tilman due to the worthless unsorted results one can't say that one is better or worst than the other. Only the sorted results are useful and those are equal. Saying that, IMHO this is not a regression


Tilman

Am 06.04.2019 um 15:50 schrieb Tim Allison:
http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz

This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though,
there were no content differences btwn 2.0.13 and 2.0.14.  I did not
apply angle detection.

No new exceptions; 2 fixed exceptions.  We're getting higher page
counts in a few documents, because we overrode processPages() to
process.  Some changes in content, but overall, better, I think, based
on contents/common_token_comparisons_by_mime.xlsx.

To see where content appears to degrade, open
contents/content_diffs_(no|with)_exceptions, and sort column M
('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order.  Also, look at
columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S
(TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most
frequent tokens that are unique to A or unique to B; from this, it
looks like there is a regression in, e.g. govdocs1/038/038519.pdf,
but, generally (hand waving), it appears that there were word
segmentation problems in both A and B as I look at the results.

Cheers,

              Tim

On Fri, Apr 5, 2019 at 10:53 AM Tim Allison <talli...@apache.org> wrote:
+1 I should have regression results by tomorrow

On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
+1

Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <andr...@lehmi.de>:

Hi,

looks like it's time for the next release. How about cutting 2.0.15 next monday?

WDYT?

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to