Re: 2.0.6 release ?

2017-05-20 Thread Andreas Lehmkuehler

Am 20.05.2017 um 16:17 schrieb Tilman Hausherr:

Am 12.05.2017 um 15:23 schrieb Allison, Timothy B.:

http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz

Looks good to me on a very cursory look.


IMO there are two files that could be investigated:

5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL - exception (was mentioned before)

APXBMVTEZIJCL7VYUN3KFSXLNETDMIKC - the first page is empty, but wasn't in the 
previous version.Please create 2 tickets, so that those can't get lost. I forgot to do so for the 

first one :-(

Andreas


Tilman



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-20 Thread Tilman Hausherr

Am 12.05.2017 um 15:23 schrieb Allison, Timothy B.:

http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz

Looks good to me on a very cursory look.


IMO there are two files that could be investigated:

5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL - exception (was mentioned before)

APXBMVTEZIJCL7VYUN3KFSXLNETDMIKC - the first page is empty, but wasn't 
in the previous version.


Tilman



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-12 Thread Allison, Timothy B.

http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz

Looks good to me on a very cursory look.




RE: 2.0.6 release ?

2017-05-11 Thread Allison, Timothy B.
> It isn't that secret as Tim posted it somewhere in this thread

:)

I've added throttling to httpd (I think) so we should be ok, and y, the address 
is out in the open now.

Let me know if I should kick off another run.

Thank you, all!


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-10 Thread Andreas Lehmkuehler

Am 10.05.2017 um 17:12 schrieb Tilman Hausherr:
Thanks for the test... the sum is still negative, but if we'd ignore the 
truncated files I bet we'd be positive.


I have downloaded a few of the regressions but won't create issues this time as 
yesterday's turned out to be duplicates, I'll wait for Andreas next commit and 
will create issues only if these aren't solved.
I guess the new exception aren't related. I've already created an issue for the 
first one, PDFBOX-3788
I didn't had a chance to look at the second file. I just tested my fix for the 
first one and it still fails.



@Andreas - ping me if you didn't keep the "secret" URL.

It isn't that secret as Tim posted it somewhere in this thread ...



Some misc thoughts...

039800.pdf: "refinery's" is a different token than refinery. Shouldn't 
"refinery's" be three tokens? I mention this because refinery is probably in a 
dictionary.


Some differences are because of a different treatment of the space in bad fonts. 
Some were improved, and some now look like this "C I T I E S W I T H O U T D R U 
G S". There is an open issue about these. It is tricky because if we treat these 
like 1 word, we'd also lose spaces where we don't want.


commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z I can't find. I used 
http://XXX.XXX.XXX.XXX/docs/commoncrawl2/5N/5NSKV4CTVY4KT7R2FGY4XJDIK4PRLA4Z


Tilman

Am 10.05.2017 um 11:42 schrieb Allison, Timothy B.:

Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-10 Thread Andreas Lehmkühler

> "Allison, Timothy B."  hat am 10. Mai 2017 um 11:42 
> geschrieben:
> 
> 
> Haven't had a chance to look. Reports are here:
> http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz
Thanks again for running the report again

I had a quick look and there are 2 new exceptions. It seems to be a regression. 
I'm going to dig deeper later when I'm back home

Here a 2 sample pfs, one for each exception
commoncrawl2/YV/YVFDWHF767TEYTT7IVFSLUIJTDF3YP57
commoncrawl2/5W/5WULWDW54DAQ4ORVJSACEE2KCXQ7PQLL

Andreas

> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-10 Thread Allison, Timothy B.
Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz


RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
I won't have results immediately.  :)

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, May 9, 2017 4:13 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.:
> UGH.  I'm so wrong.  I accidentally had a 2.0.4.jar in my app/target...
>
> 
>
> Off we go?

Yes! However it's 10pm here, so I won't be able to react to the results 
immediately.

Tilman

>
>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, May 9, 2017 3:49 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> You caught me... I haven't checked these yet.
>
> But I did now, with
> MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
> 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
> IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
> but they don't throw an NPE anymore now.
>
> Oops... I see I have that check you mention in my code, it has been there for 
> months and I forgot to make an issue. But after removing it, it still works 
> with the three files... so the question is, can this parameter ever be null, 
> or not?
>
> Tilman
>
> Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
>> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 
>> new NPE exceptions)?  Has this been fixed, or would that cause unintended 
>> problems?
>>
>>   /**
>>* Returns true if the node is a page tree node (i.e. and 
>> intermediate).
>>*/
>>   private boolean isPageTreeNode(COSDictionary node )
>>   {
>>   // some files such as PDFBOX-2250-229205.pdf don't have Pages set 
>> as the Type, so we have
>>   // to check for the presence of Kids too
>>   return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
>>  node.containsKey(COSName.KIDS);
>>       }
>>
>> -Original Message-
>> From: Tilman Hausherr [mailto:thaush...@t-online.de]
>> Sent: Tuesday, May 9, 2017 3:20 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: 2.0.6 release ?
>>
>> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>>> I've fixed all remaining regression tickets (in the end it was 
>>>> exactly 1)
>>> Great!  Thank you!
>>>
>>> Let me know when I should kick off another eval.
>> Yes, please do.
>>
>> Thanks
>>
>> Tilman
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
> B CB  
> [  X  ܚX KK[XZ[
>   ] ][  X  ܚX P
>   \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
>   ] Z[
>   \X K ܙ B B
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




Re: 2.0.6 release ?

2017-05-09 Thread Tilman Hausherr

Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.:

UGH.  I'm so wrong.  I accidentally had a 2.0.4.jar in my app/target...



Off we go?


Yes! However it's 10pm here, so I won't be able to react to the results 
immediately.


Tilman




-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for 
months and I forgot to make an issue. But after removing it, it still works 
with the three files... so the question is, can this parameter ever be null, or 
not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:

Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new 
NPE exceptions)?  Has this been fixed, or would that cause unintended problems?

  /**
   * Returns true if the node is a page tree node (i.e. and intermediate).
   */
  private boolean isPageTreeNode(COSDictionary node )
  {
  // some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
the Type, so we have
  // to check for the presence of Kids too
  return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
 node.containsKey(COSName.KIDS);
  }

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 3:20 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:

I've fixed all remaining regression tickets (in the end it was
exactly 1)

Great!  Thank you!

Let me know when I should kick off another eval.

Yes, please do.

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For
additional commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For
additional commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B CB  [  
X  ܚX KK[XZ[
  ] ][  X  ܚX P
  \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
  ] Z[
  \X K ܙ B B

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
UGH.  I'm so wrong.  I accidentally had a 2.0.4.jar in my app/target...



Off we go?


-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for 
months and I forgot to make an issue. But after removing it, it still works 
with the three files... so the question is, can this parameter ever be null, or 
not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 
> new NPE exceptions)?  Has this been fixed, or would that cause unintended 
> problems?
>
>  /**
>   * Returns true if the node is a page tree node (i.e. and intermediate).
>   */
>  private boolean isPageTreeNode(COSDictionary node )
>  {
>  // some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
> the Type, so we have
>  // to check for the presence of Kids too
>  return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
> node.containsKey(COSName.KIDS);
>  }
>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B CB  [  
X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B


RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
With lots of empty pages...

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Tuesday, May 9, 2017 3:57 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

Doh.  AR can't open it.  Sorry.  Chrome appears to be able to open it.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, May 9, 2017 3:56 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K

throws NPE and opens without complaint in AR.

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for 
months and I forgot to make an issue. But after removing it, it still works 
with the three files... so the question is, can this parameter ever be null, or 
not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 
> new NPE exceptions)?  Has this been fixed, or would that cause unintended 
> problems?
>
>  /**
>   * Returns true if the node is a page tree node (i.e. and intermediate).
>   */
>  private boolean isPageTreeNode(COSDictionary node )
>  {
>  // some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
> the Type, so we have
>  // to check for the presence of Kids too
>  return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
> node.containsKey(COSName.KIDS);
>  }
>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B CB  [  
X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B


Re: 2.0.6 release ?

2017-05-09 Thread Tilman Hausherr

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been 
there for months and I forgot to make an issue. But after removing it, 
it still works with the three files... so the question is, can this 
parameter ever be null, or not?


Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:

Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new 
NPE exceptions)?  Has this been fixed, or would that cause unintended problems?

 /**
  * Returns true if the node is a page tree node (i.e. and intermediate).
  */
 private boolean isPageTreeNode(COSDictionary node )
 {
 // some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
the Type, so we have
 // to check for the presence of Kids too
 return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
node.containsKey(COSName.KIDS);
 }

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 3:20 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:

I've fixed all remaining regression tickets (in the end it was
exactly 1)

Great!  Thank you!

Let me know when I should kick off another eval.


Yes, please do.

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new 
NPE exceptions)?  Has this been fixed, or would that cause unintended problems?

/**
 * Returns true if the node is a page tree node (i.e. and intermediate).
 */
private boolean isPageTreeNode(COSDictionary node )
{
// some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
the Type, so we have
// to check for the presence of Kids too
return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
   node.containsKey(COSName.KIDS);
}

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, May 9, 2017 3:20 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>> I've fixed all remaining regression tickets (in the end it was 
>> exactly 1)
> Great!  Thank you!
>
> Let me know when I should kick off another eval.


Yes, please do.

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




Re: 2.0.6 release ?

2017-05-09 Thread Tilman Hausherr

Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:

I've fixed all remaining regression tickets (in the end it was exactly 1)

Great!  Thank you!

Let me know when I should kick off another eval.



Yes, please do.

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
>I've fixed all remaining regression tickets (in the end it was exactly 1)

Great!  Thank you!

Let me know when I should kick off another eval.

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Added a page count comparison report under "content/":

http://162.242.228.174/reports/reports_pdfbox_2_0_6c.tar.gz

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Tuesday, May 9, 2017 2:39 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz

Added CONTAINER_LENGTH to reports that have a file path.  This is the length in 
bytes of the container file (as opposed to the embedded file).

Thank you!

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table that is 
about individual files. I think it was there in the past, but I may be wrong.

Reason: I try to get small files to keep any "examples" for my regression tests.

Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org

B�CB��[��X��ܚX�KK[XZ[
�]�][��X��ܚX�P���
�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
�]�Z[���
�\X�K�ܙ�B�B

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz

Added CONTAINER_LENGTH to reports that have a file path.  This is the length in 
bytes of the container file (as opposed to the embedded file).

Thank you!

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table that is 
about individual files. I think it was there in the past, but I may be wrong.

Reason: I try to get small files to keep any "examples" for my regression tests.

Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-09 Thread Andreas Lehmkuehler


Am 09.05.2017 um 19:52 schrieb Tilman Hausherr:

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:

Content

1)  To get a _general_ sense of overall content extract, see "content/ 
common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k 
"common words"[1], which out of 2.6 billion isn't much.  However, we also lost 
18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 
1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an 
improvement.


2)  If you want to compare content whether or not one there was a parse 
exception, see "content/content_diffs_with_exceptions.xlsx"


3) If you only want to see content diffs where both extracts did not have an 
exception, see "content/content_diffs_ignore_exceptions.xlsx".


To make quick sense of the content_diffs_files, sort 
"NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files 
lost the most common tokens.


To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, 
which compare the number of unique tokens/tokens in common...a low number 
means little similarity, while a number close to 1.0 means that the unigrams 
are nearly identical.



 From a quick look, many of the files with fewer common words are in the 
"likely_broken" and or "truncated" subdirectories...  Some exceptions to this 
rule include the following, but there are more...and overall, there is a fair 
amount of loss from 2.0.3.


govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56


Thanks for the test... three of these four have been fixed, this was yet another 
trouble recognizing the end of inline images. All were created by "Leadtools". 
The fourth (202097.pdf) is in issue PDFBOX-3785.


Most issues are probably related to truncated files. Some of these do not even 
display with Adobe Reader.

I've fixed all remaining regression tickets (in the end it was exactly 1)

@Tim Thanks for running the comparison
@Tilman Thanks for analyzing

Andreas




Tilman





[1] For this version of tika-eval, I expanded Tilman's initial recommendation 
of common words for English a bit.  I took the top 20k most common words (4 
characters or more, except for CJK) for a large number of Wikipedia dumps.  I 
removed common html markup words (body, form, table) so that failure to strip 
html doesn't incorrectly boost scores.


  We apply language id and then use the common words for that language.  For 
example, for 
truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW


* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 
tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there 
were 320 common words from the English list of common words.

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:

Happy to.  Will kick off now?

Yes

Tilman


-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Saturday, May 6, 2017 10:02 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:

Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:

Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now,
any objections?

I'm targeting the 15th or 16th

Tim, could you please run your tests when time allows?

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For
additional commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For
additional commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



B CB  
[  X  ܚX KK[XZ[

  ] ][  X  ܚX P
  \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
  ] Z[
  \X K ܙ B B

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-09 Thread Tilman Hausherr

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:

Content

1)  To get a _general_ sense of overall content extract, see "content/ 
common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k "common 
words"[1], which out of 2.6 billion isn't much.  However, we also lost 18 million common words 
going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 
would have led to an improvement.

2)  If you want to compare content whether or not one there was a parse exception, see 
"content/content_diffs_with_exceptions.xlsx"

3) If you only want to see content diffs where both extracts did not have an exception, 
see "content/content_diffs_ignore_exceptions.xlsx".

To make quick sense of the content_diffs_files, sort 
"NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files 
lost the most common tokens.

To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which 
compare the number of unique tokens/tokens in common...a low number means 
little similarity, while a number close to 1.0 means that the unigrams are 
nearly identical.


 From a quick look, many of the files with fewer common words are in the "likely_broken" 
and or "truncated" subdirectories...  Some exceptions to this rule include the following, 
but there are more...and overall, there is a fair amount of loss from 2.0.3.

govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56


Thanks for the test... three of these four have been fixed, this was yet 
another trouble recognizing the end of inline images. All were created 
by "Leadtools". The fourth (202097.pdf) is in issue PDFBOX-3785.


Most issues are probably related to truncated files. Some of these do 
not even display with Adobe Reader.


Tilman





[1] For this version of tika-eval, I expanded Tilman's initial recommendation 
of common words for English a bit.  I took the top 20k most common words (4 
characters or more, except for CJK) for a large number of Wikipedia dumps.  I 
removed common html markup words (body, form, table) so that failure to strip 
html doesn't incorrectly boost scores.

  We apply language id and then use the common words for that language.  For 
example, for 
truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW

* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 
tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there 
were 320 common words from the English list of common words.
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:

Happy to.  Will kick off now?

Yes

Tilman


-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Saturday, May 6, 2017 10:02 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:

Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:

Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now,
any objections?

I'm targeting the 15th or 16th

Tim, could you please run your tests when time allows?

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For
additional commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For
additional commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B CB  [  
X  ܚX KK[XZ[
  ] ][  X  ܚX P
  \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
  ] Z[
  \X K ܙ B B

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Y.  Will do.  Meetings beckon, so it will take a few hours. :(

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table that is 
about individual files. I think it was there in the past, but I may be wrong.

Reason: I try to get small files to keep any "examples" for my regression tests.

Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-09 Thread Tilman Hausherr

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:

Tilman's initial recommendation



Can you do me another favor? Have a column with the size in any table 
that is about individual files. I think it was there in the past, but I 
may be wrong.


Reason: I try to get small files to keep any "examples" for my 
regression tests.


Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-08 Thread Allison, Timothy B.
Content

1)  To get a _general_ sense of overall content extract, see "content/ 
common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k 
"common words"[1], which out of 2.6 billion isn't much.  However, we also lost 
18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 
1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an 
improvement.

2)  If you want to compare content whether or not one there was a parse 
exception, see "content/content_diffs_with_exceptions.xlsx"

3) If you only want to see content diffs where both extracts did not have an 
exception, see "content/content_diffs_ignore_exceptions.xlsx".

To make quick sense of the content_diffs_files, sort 
"NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files 
lost the most common tokens.

To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which 
compare the number of unique tokens/tokens in common...a low number means 
little similarity, while a number close to 1.0 means that the unigrams are 
nearly identical.


From a quick look, many of the files with fewer common words are in the 
"likely_broken" and or "truncated" subdirectories...  Some exceptions to this 
rule include the following, but there are more...and overall, there is a fair 
amount of loss from 2.0.3.

govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56

[1] For this version of tika-eval, I expanded Tilman's initial recommendation 
of common words for English a bit.  I took the top 20k most common words (4 
characters or more, except for CJK) for a large number of Wikipedia dumps.  I 
removed common html markup words (body, form, table) so that failure to strip 
html doesn't incorrectly boost scores.

 We apply language id and then use the common words for that language.  For 
example, for 
truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW

* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 
tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there 
were 320 common words from the English list of common words.
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to.  Will kick off now?

Yes

Tilman

>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Saturday, May 6, 2017 10:02 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>>> any objections?
>> I'm targeting the 15th or 16th
> Tim, could you please run your tests when time allows?
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B CB  [  
X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-08 Thread Tilman Hausherr

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:

Happy to.  Will kick off now?


Yes

Tilman



-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Saturday, May 6, 2017 10:02 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:

Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:

Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now,
any objections?

I'm targeting the 15th or 16th

Tim, could you please run your tests when time allows?

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-08 Thread Allison, Timothy B.
Happy to.  Will kick off now?

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Saturday, May 6, 2017 10:02 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>> Hi,
>>
>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>> any objections?
> I'm targeting the 15th or 16th

Tim, could you please run your tests when time allows?

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




Re: 2.0.6 release ?

2017-05-06 Thread Tilman Hausherr

Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:

Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:

Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
any objections?
I'm targeting the 15th or 16th 


Tim, could you please run your tests when time allows?

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




Re: 2.0.6 release ?

2017-05-04 Thread Andreas Lehmkuehler

Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:

Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any 
objections?

I've added 2.0.7 as new version to JIRA

Andreas



Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-04 Thread Andreas Lehmkuehler

Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:

Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any 
objections?

I'm targeting the 15th or 16th

Andreas



Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-02 Thread Maruan Sahyoun
+1  can work on some tickets over the weekend


> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler <andr...@lehmi.de>:
> 
> Hi,
> 
> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any 
> objections?
> 
> Andreas
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.6 release ?

2017-05-02 Thread Tilman Hausherr

Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:

Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any 
objections?


I'm always "+1" for new releases.

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



2.0.6 release ?

2017-05-02 Thread Andreas Lehmkühler
Hi,

I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, any 
objections?

Andreas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org