Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
Content

1)  To get a _general_ sense of overall content extract, see "content/ 
common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k "common 
words"[1], which out of 2.6 billion isn't much.  However, we also lost 18 million common words 
going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 
would have led to an improvement.

2)  If you want to compare content whether or not one there was a parse exception, see 
"content/content_diffs_with_exceptions.xlsx"

3) If you only want to see content diffs where both extracts did not have an exception, 
see "content/content_diffs_ignore_exceptions.xlsx".

To make quick sense of the content_diffs_files, sort 
"NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files 
lost the most common tokens.

To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which 
compare the number of unique tokens/tokens in common...a low number means 
little similarity, while a number close to 1.0 means that the unigrams are 
nearly identical.


 From a quick look, many of the files with fewer common words are in the "likely_broken" 
and or "truncated" subdirectories...  Some exceptions to this rule include the following, 
but there are more...and overall, there is a fair amount of loss from 2.0.3.

govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56

Thanks for the test... three of these four have been fixed, this was yet another trouble recognizing the end of inline images. All were created by "Leadtools". The fourth (202097.pdf) is in issue PDFBOX-3785.

Most issues are probably related to truncated files. Some of these do not even display with Adobe Reader.

Tilman




[1] For this version of tika-eval, I expanded Tilman's initial recommendation 
of common words for English a bit.  I took the top 20k most common words (4 
characters or more, except for CJK) for a large number of Wikipedia dumps.  I 
removed common html markup words (body, form, table) so that failure to strip 
html doesn't incorrectly boost scores.

  We apply language id and then use the common words for that language.  For 
example, for 
truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW

* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 
tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there 
were 320 common words from the English list of common words.
-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
Happy to.  Will kick off now?
Yes

Tilman

-----Original Message-----
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Saturday, May 6, 2017 10:02 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now,
any objections?
I'm targeting the 15th or 16th
Tim, could you please run your tests when time allows?

Thanks

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For
additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For
additional commands, e-mail: dev-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB  [  
X  ܚX KK[XZ[
  ] ][  X  ܚX P
  \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
  ] Z[
  \X K ܙ B B

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to