Re: 2.0.6 release ?
Am 20.05.2017 um 16:17 schrieb Tilman Hausherr: Am 12.05.2017 um 15:23 schrieb Allison, Timothy B.: http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz Looks good to me on a very cursory look. IMO there are two files that could be investigated: 5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL - exception (was mentioned before) APXBMVTEZIJCL7VYUN3KFSXLNETDMIKC - the first page is empty, but wasn't in the previous version.Please create 2 tickets, so that those can't get lost. I forgot to do so for the first one :-( Andreas Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 12.05.2017 um 15:23 schrieb Allison, Timothy B.: http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz Looks good to me on a very cursory look. IMO there are two files that could be investigated: 5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL - exception (was mentioned before) APXBMVTEZIJCL7VYUN3KFSXLNETDMIKC - the first page is empty, but wasn't in the previous version. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz Looks good to me on a very cursory look.
RE: 2.0.6 release ?
> It isn't that secret as Tim posted it somewhere in this thread :) I've added throttling to httpd (I think) so we should be ok, and y, the address is out in the open now. Let me know if I should kick off another run. Thank you, all! - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 10.05.2017 um 17:12 schrieb Tilman Hausherr: Thanks for the test... the sum is still negative, but if we'd ignore the truncated files I bet we'd be positive. I have downloaded a few of the regressions but won't create issues this time as yesterday's turned out to be duplicates, I'll wait for Andreas next commit and will create issues only if these aren't solved. I guess the new exception aren't related. I've already created an issue for the first one, PDFBOX-3788 I didn't had a chance to look at the second file. I just tested my fix for the first one and it still fails. @Andreas - ping me if you didn't keep the "secret" URL. It isn't that secret as Tim posted it somewhere in this thread ... Some misc thoughts... 039800.pdf: "refinery's" is a different token than refinery. Shouldn't "refinery's" be three tokens? I mention this because refinery is probably in a dictionary. Some differences are because of a different treatment of the space in bad fonts. Some were improved, and some now look like this "C I T I E S W I T H O U T D R U G S". There is an open issue about these. It is tricky because if we treat these like 1 word, we'd also lose spaces where we don't want. commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z I can't find. I used http://XXX.XXX.XXX.XXX/docs/commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z Tilman Am 10.05.2017 um 11:42 schrieb Allison, Timothy B.: Haven't had a chance to look. Reports are here: http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
> "Allison, Timothy B."hat am 10. Mai 2017 um 11:42 > geschrieben: > > > Haven't had a chance to look. Reports are here: > http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz Thanks again for running the report again I had a quick look and there are 2 new exceptions. It seems to be a regression. I'm going to dig deeper later when I'm back home Here a 2 sample pfs, one for each exception commoncrawl2/YV/YVFDWHF767TEYTT7IVFSLUIJTDF3YP57 commoncrawl2/5W/5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL Andreas > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Haven't had a chance to look. Reports are here: http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz
RE: 2.0.6 release ?
I won't have results immediately. :) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 4:13 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.: > UGH. I'm so wrong. I accidentally had a 2.0.4.jar in my app/target... > > > > Off we go? Yes! However it's 10pm here, so I won't be able to react to the results immediately. Tilman > > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, May 9, 2017 3:49 PM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > You caught me... I haven't checked these yet. > > But I did now, with > MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf > 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx > IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx > but they don't throw an NPE anymore now. > > Oops... I see I have that check you mention in my code, it has been there for > months and I forgot to make an issue. But after removing it, it still works > with the three files... so the question is, can this parameter ever be null, > or not? > > Tilman > > Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: >> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 >> new NPE exceptions)? Has this been fixed, or would that cause unintended >> problems? >> >> /** >>* Returns true if the node is a page tree node (i.e. and >> intermediate). >>*/ >> private boolean isPageTreeNode(COSDictionary node ) >> { >> // some files such as PDFBOX-2250-229205.pdf don't have Pages set >> as the Type, so we have >> // to check for the presence of Kids too >> return node.getCOSName(COSName.TYPE) == COSName.PAGES || >> node.containsKey(COSName.KIDS); >> } >> >> -Original Message- >> From: Tilman Hausherr [mailto:thaush...@t-online.de] >> Sent: Tuesday, May 9, 2017 3:20 PM >> To: dev@pdfbox.apache.org >> Subject: Re: 2.0.6 release ? >> >> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >>>> I've fixed all remaining regression tickets (in the end it was >>>> exactly 1) >>> Great! Thank you! >>> >>> Let me know when I should kick off another eval. >> Yes, please do. >> >> Thanks >> >> Tilman >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >> additional commands, e-mail: dev-h...@pdfbox.apache.org >> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >> additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > B CB > [ X ܚX KK[XZ[ > ] ][ X ܚX P > \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ > ] Z[ > \X K ܙ B B > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.: UGH. I'm so wrong. I accidentally had a 2.0.4.jar in my app/target... Off we go? Yes! However it's 10pm here, so I won't be able to react to the results immediately. Tilman -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:49 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? You caught me... I haven't checked these yet. But I did now, with MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx but they don't throw an NPE anymore now. Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not? Tilman Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)? Has this been fixed, or would that cause unintended problems? /** * Returns true if the node is a page tree node (i.e. and intermediate). */ private boolean isPageTreeNode(COSDictionary node ) { // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have // to check for the presence of Kids too return node.getCOSName(COSName.TYPE) == COSName.PAGES || node.containsKey(COSName.KIDS); } -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:20 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: I've fixed all remaining regression tickets (in the end it was exactly 1) Great! Thank you! Let me know when I should kick off another eval. Yes, please do. Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
UGH. I'm so wrong. I accidentally had a 2.0.4.jar in my app/target... Off we go? -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:49 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? You caught me... I haven't checked these yet. But I did now, with MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx but they don't throw an NPE anymore now. Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not? Tilman Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: > Should we return false if the node is null in PDPageTree#isPageTreeNode (66 > new NPE exceptions)? Has this been fixed, or would that cause unintended > problems? > > /** > * Returns true if the node is a page tree node (i.e. and intermediate). > */ > private boolean isPageTreeNode(COSDictionary node ) > { > // some files such as PDFBOX-2250-229205.pdf don't have Pages set as > the Type, so we have > // to check for the presence of Kids too > return node.getCOSName(COSName.TYPE) == COSName.PAGES || > node.containsKey(COSName.KIDS); > } > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, May 9, 2017 3:20 PM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >>> I've fixed all remaining regression tickets (in the end it was >>> exactly 1) >> Great! Thank you! >> >> Let me know when I should kick off another eval. > > Yes, please do. > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B
RE: 2.0.6 release ?
With lots of empty pages... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 3:57 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 release ? Doh. AR can't open it. Sorry. Chrome appears to be able to open it. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 3:56 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 release ? commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K throws NPE and opens without complaint in AR. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:49 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? You caught me... I haven't checked these yet. But I did now, with MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx but they don't throw an NPE anymore now. Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not? Tilman Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: > Should we return false if the node is null in PDPageTree#isPageTreeNode (66 > new NPE exceptions)? Has this been fixed, or would that cause unintended > problems? > > /** > * Returns true if the node is a page tree node (i.e. and intermediate). > */ > private boolean isPageTreeNode(COSDictionary node ) > { > // some files such as PDFBOX-2250-229205.pdf don't have Pages set as > the Type, so we have > // to check for the presence of Kids too > return node.getCOSName(COSName.TYPE) == COSName.PAGES || > node.containsKey(COSName.KIDS); > } > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, May 9, 2017 3:20 PM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >>> I've fixed all remaining regression tickets (in the end it was >>> exactly 1) >> Great! Thank you! >> >> Let me know when I should kick off another eval. > > Yes, please do. > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B
Re: 2.0.6 release ?
You caught me... I haven't checked these yet. But I did now, with MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx but they don't throw an NPE anymore now. Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not? Tilman Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)? Has this been fixed, or would that cause unintended problems? /** * Returns true if the node is a page tree node (i.e. and intermediate). */ private boolean isPageTreeNode(COSDictionary node ) { // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have // to check for the presence of Kids too return node.getCOSName(COSName.TYPE) == COSName.PAGES || node.containsKey(COSName.KIDS); } -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:20 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: I've fixed all remaining regression tickets (in the end it was exactly 1) Great! Thank you! Let me know when I should kick off another eval. Yes, please do. Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)? Has this been fixed, or would that cause unintended problems? /** * Returns true if the node is a page tree node (i.e. and intermediate). */ private boolean isPageTreeNode(COSDictionary node ) { // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have // to check for the presence of Kids too return node.getCOSName(COSName.TYPE) == COSName.PAGES || node.containsKey(COSName.KIDS); } -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:20 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >> I've fixed all remaining regression tickets (in the end it was >> exactly 1) > Great! Thank you! > > Let me know when I should kick off another eval. Yes, please do. Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: I've fixed all remaining regression tickets (in the end it was exactly 1) Great! Thank you! Let me know when I should kick off another eval. Yes, please do. Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
>I've fixed all remaining regression tickets (in the end it was exactly 1) Great! Thank you! Let me know when I should kick off another eval. - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Added a page count comparison report under "content/": http://162.242.228.174/reports/reports_pdfbox_2_0_6c.tar.gz -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 2:39 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 release ? http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz Added CONTAINER_LENGTH to reports that have a file path. This is the length in bytes of the container file (as opposed to the embedded file). Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 10:07 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: > Tilman's initial recommendation Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong. Reason: I try to get small files to keep any "examples" for my regression tests. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B�CB��[��X��ܚX�KK[XZ[ �]�][��X��ܚX�P��� �\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[ �]�Z[��� �\X�K�ܙ�B�B - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz Added CONTAINER_LENGTH to reports that have a file path. This is the length in bytes of the container file (as opposed to the embedded file). Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 10:07 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: > Tilman's initial recommendation Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong. Reason: I try to get small files to keep any "examples" for my regression tests. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 09.05.2017 um 19:52 schrieb Tilman Hausherr: Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: Content 1) To get a _general_ sense of overall content extract, see "content/ common_token_comparisons_by_mime.xlsx" This suggests that we've lost 248k "common words"[1], which out of 2.6 billion isn't much. However, we also lost 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an improvement. 2) If you want to compare content whether or not one there was a parse exception, see "content/content_diffs_with_exceptions.xlsx" 3) If you only want to see content diffs where both extracts did not have an exception, see "content/content_diffs_ignore_exceptions.xlsx". To make quick sense of the content_diffs_files, sort "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files lost the most common tokens. To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which compare the number of unique tokens/tokens in common...a low number means little similarity, while a number close to 1.0 means that the unigrams are nearly identical. From a quick look, many of the files with fewer common words are in the "likely_broken" and or "truncated" subdirectories... Some exceptions to this rule include the following, but there are more...and overall, there is a fair amount of loss from 2.0.3. govdocs1/202/202097.pdf govdocs1/358/358043.pdf commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6 commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56 Thanks for the test... three of these four have been fixed, this was yet another trouble recognizing the end of inline images. All were created by "Leadtools". The fourth (202097.pdf) is in issue PDFBOX-3785. Most issues are probably related to truncated files. Some of these do not even display with Adobe Reader. I've fixed all remaining regression tickets (in the end it was exactly 1) @Tim Thanks for running the comparison @Tilman Thanks for analyzing Andreas Tilman [1] For this version of tika-eval, I expanded Tilman's initial recommendation of common words for English a bit. I took the top 20k most common words (4 characters or more, except for CJK) for a large number of Wikipedia dumps. I removed common html markup words (body, form, table) so that failure to strip html doesn't incorrectly boost scores. We apply language id and then use the common words for that language. For example, for truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW * PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 tokens from the French list of common words. * PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there were 320 common words from the English list of common words. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, May 8, 2017 10:01 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.: Happy to. Will kick off now? Yes Tilman -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, May 6, 2017 10:02 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: Hi, I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections? I'm targeting the 15th or 16th Tim, could you please run your tests when time allows? Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: Content 1) To get a _general_ sense of overall content extract, see "content/ common_token_comparisons_by_mime.xlsx" This suggests that we've lost 248k "common words"[1], which out of 2.6 billion isn't much. However, we also lost 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an improvement. 2) If you want to compare content whether or not one there was a parse exception, see "content/content_diffs_with_exceptions.xlsx" 3) If you only want to see content diffs where both extracts did not have an exception, see "content/content_diffs_ignore_exceptions.xlsx". To make quick sense of the content_diffs_files, sort "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files lost the most common tokens. To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which compare the number of unique tokens/tokens in common...a low number means little similarity, while a number close to 1.0 means that the unigrams are nearly identical. From a quick look, many of the files with fewer common words are in the "likely_broken" and or "truncated" subdirectories... Some exceptions to this rule include the following, but there are more...and overall, there is a fair amount of loss from 2.0.3. govdocs1/202/202097.pdf govdocs1/358/358043.pdf commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6 commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56 Thanks for the test... three of these four have been fixed, this was yet another trouble recognizing the end of inline images. All were created by "Leadtools". The fourth (202097.pdf) is in issue PDFBOX-3785. Most issues are probably related to truncated files. Some of these do not even display with Adobe Reader. Tilman [1] For this version of tika-eval, I expanded Tilman's initial recommendation of common words for English a bit. I took the top 20k most common words (4 characters or more, except for CJK) for a large number of Wikipedia dumps. I removed common html markup words (body, form, table) so that failure to strip html doesn't incorrectly boost scores. We apply language id and then use the common words for that language. For example, for truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW * PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 tokens from the French list of common words. * PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there were 320 common words from the English list of common words. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, May 8, 2017 10:01 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.: Happy to. Will kick off now? Yes Tilman -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, May 6, 2017 10:02 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: Hi, I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections? I'm targeting the 15th or 16th Tim, could you please run your tests when time allows? Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Y. Will do. Meetings beckon, so it will take a few hours. :( -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 10:07 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: > Tilman's initial recommendation Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong. Reason: I try to get small files to keep any "examples" for my regression tests. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: Tilman's initial recommendation Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong. Reason: I try to get small files to keep any "examples" for my regression tests. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Content 1) To get a _general_ sense of overall content extract, see "content/ common_token_comparisons_by_mime.xlsx" This suggests that we've lost 248k "common words"[1], which out of 2.6 billion isn't much. However, we also lost 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an improvement. 2) If you want to compare content whether or not one there was a parse exception, see "content/content_diffs_with_exceptions.xlsx" 3) If you only want to see content diffs where both extracts did not have an exception, see "content/content_diffs_ignore_exceptions.xlsx". To make quick sense of the content_diffs_files, sort "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files lost the most common tokens. To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which compare the number of unique tokens/tokens in common...a low number means little similarity, while a number close to 1.0 means that the unigrams are nearly identical. From a quick look, many of the files with fewer common words are in the "likely_broken" and or "truncated" subdirectories... Some exceptions to this rule include the following, but there are more...and overall, there is a fair amount of loss from 2.0.3. govdocs1/202/202097.pdf govdocs1/358/358043.pdf commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6 commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56 [1] For this version of tika-eval, I expanded Tilman's initial recommendation of common words for English a bit. I took the top 20k most common words (4 characters or more, except for CJK) for a large number of Wikipedia dumps. I removed common html markup words (body, form, table) so that failure to strip html doesn't incorrectly boost scores. We apply language id and then use the common words for that language. For example, for truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW * PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 tokens from the French list of common words. * PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there were 320 common words from the English list of common words. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, May 8, 2017 10:01 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.: > Happy to. Will kick off now? Yes Tilman > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Saturday, May 6, 2017 10:02 AM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: >> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: >>> Hi, >>> >>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, >>> any objections? >> I'm targeting the 15th or 16th > Tim, could you please run your tests when time allows? > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.: Happy to. Will kick off now? Yes Tilman -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, May 6, 2017 10:02 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: Hi, I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections? I'm targeting the 15th or 16th Tim, could you please run your tests when time allows? Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Happy to. Will kick off now? -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, May 6, 2017 10:02 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: > Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: >> Hi, >> >> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, >> any objections? > I'm targeting the 15th or 16th Tim, could you please run your tests when time allows? Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: Hi, I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections? I'm targeting the 15th or 16th Tim, could you please run your tests when time allows? Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: Hi, I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections? I've added 2.0.7 as new version to JIRA Andreas Andreas - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: Hi, I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections? I'm targeting the 15th or 16th Andreas Andreas - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
+1 can work on some tickets over the weekend > Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler <andr...@lehmi.de>: > > Hi, > > I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any > objections? > > Andreas > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: 2.0.6 release ?
Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: Hi, I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections? I'm always "+1" for new releases. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
2.0.6 release ?
Hi, I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any objections? Andreas - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org