RE: PDFBox 2.0.9 release?

2018-03-15 Thread Allison, Timothy B.
> PDFBOX-4153 is solved. How about cutting the release next Monday?
+1 and thank you!

Tim



RE: PDFBox 2.0.9 release?

2018-03-12 Thread Allison, Timothy B.
Reports are available here:

http://162.242.228.174/reports/pdfbox-2.0.9-pre-rc1_reports_2.tar.bz2




RE: PDFBox 2.0.9 release?

2018-03-12 Thread Allison, Timothy B.
> ok => Tim, please start again

Will start now.





RE: PDFBox 2.0.9 release?

2018-03-09 Thread Allison, Timothy B.
I'm happy to run the regression tests again when all final changes for 
2.0.9-RC1 are made.  I'm really excited to be able to include jbig2.  We'll 
start the Tika release process for 1.18 as soon as PDFBox 2.0.9 is available.

Thank you, all!

Cheers,

 Tim


RE: PDFBox 2.0.9 release?

2018-03-09 Thread Allison, Timothy B.
http://162.242.228.174/reports/pdfbox-2.0.9-pre-rc1_reports.tar.bz2

Looks good to me.  Only 3 new exceptions (all on truncated files), more common 
words.  No page diffs, no attachment diffs.

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Thursday, March 8, 2018 3:52 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.9 release?

Am 08.03.2018 um 21:35 schrieb Allison, Timothy B.:
> I've kicked off an initial run with 2.0.9-SNAPSHOT on the regression corpus.  
> While I had some time, I wanted to see if there were any early indicators of 
> problems.
>
> Tilman, I didn't mean to steal this task from you!  We'll probably need 
> another run once there's agreement that 2.0.9-SNAPSHOT is really, truly ready 
> for rc1.

No problem. I've been too busy due the many excellent patches we got in 
February and March, and now I'm somewhat exhausted. I'll be back in better 
shape on saturday and will analyse the results.

Tilman


>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Wednesday, March 7, 2018 8:03 AM
> To: dev@pdfbox.apache.org
> Subject: RE: PDFBox 2.0.9 release?
>
> Argh.  Sorry for my delay.  Y. I have time, and I'm happy to help Tilman if 
> he'd prefer to lead the regression testing process again.
>
> Cheers,
>
> Tim
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: PDFBox 2.0.9 release?

2018-03-08 Thread Allison, Timothy B.
I've kicked off an initial run with 2.0.9-SNAPSHOT on the regression corpus.  
While I had some time, I wanted to see if there were any early indicators of 
problems.

Tilman, I didn't mean to steal this task from you!  We'll probably need another 
run once there's agreement that 2.0.9-SNAPSHOT is really, truly ready for rc1.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Wednesday, March 7, 2018 8:03 AM
To: dev@pdfbox.apache.org
Subject: RE: PDFBox 2.0.9 release?

Argh.  Sorry for my delay.  Y. I have time, and I'm happy to help Tilman if 
he'd prefer to lead the regression testing process again.

Cheers,

   Tim


RE: PDFBox 2.0.9 release?

2018-03-07 Thread Allison, Timothy B.
Argh.  Sorry for my delay.  Y. I have time, and I'm happy to help Tilman if 
he'd prefer to lead the regression testing process again.

Cheers,

   Tim

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Monday, March 5, 2018 1:28 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.9 release?

[resending, as my first attempt swallowed my second command due to a wrong 
formatting]

Am 04.03.2018 um 13:33 schrieb Tilman Hausherr:
> I have the time and I should do it, because I lost my notes from last 
> time, which had some hints and command lines that go beyond the 
> documentation on the web. These notes were on a USB stick that is 
> attached to my keyboard that is attached to my 2 PCs via a switch, so 
> I could access these notes regardless which PC is on. That USB stick 
> was recently destroyed (thank you, KINGSTON!) by a static discharge likely 
> related to the dry winter air.
Argh, I know what you mean. I have to fight with the fedora update process from 
time to time :-(

> However I need a few days to finish the issues I am working on, and 
> the issues targeted for 2.0.9. So monday next week would be too early.
We are not in a hurry, take your time ...

Andreas

> 
> Tilman
> 
> Am 04.03.2018 um 12:50 schrieb Andreas Lehmkuehler:
>> Hi,
>>
>> now that we got the JBIG2 ImageIO out of the door it's time to 
>> release a new 2.0.x version of PDFBox.
>>
>> WDYT?
>>
>> @Tim, @Tilman
>> Do you have some time to run the regression test?
>>
>> Andreas
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: Running tika-eval on the Rackspace vm

2017-11-07 Thread Allison, Timothy B.
Great!  Thank you, Tilman!

I updated the wiki based on your feedback.  Let me know if I should add 
anything else while the experience is fresh.

Best,

 Tim

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, November 6, 2017 3:00 PM
To: dev@pdfbox.apache.org
Subject: Re: Running tika-eval on the Rackspace vm

I think I was successful, the report now makes sense, as if Tim had created it 
himself :-) The two issues I just created are related to a comparison between 
2.0.8 and 2.0.4.

So for that next board report, we can now (additional to the existing
text) tell that there is now a second committer who can run the tests.

Tilman

Am 05.11.2017 um 22:06 schrieb Tilman Hausherr:
> I've come closer to find out what's happening. I found out that 
> tika-app was running with PDFBox 2.0.7 all the time regardless of what 
> pdfbox version is in the pom.xml.
>
> Apparently, building tika-app uses tika-parsers from the repository 
> (instead building tika-parsers it again), which needs 2.0.7.
> Explicitely building tika-parsers before building tika-app helps.
>
> This is new to me, in PDFBox  if one builds the app all dependencies 
> are built as well.
>
> Tilman
>
> Am 04.11.2017 um 14:48 schrieb Tilman Hausherr:
>> So it's done:
>> /work/eval/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports_03112017
>>
>> I wonder why the differences are so few, especially in meta where I 
>> KNOW that there are differences, due to the handling of empty strings 
>> with BOM. Maybe it is because I skipped the "A" phase and used 
>> existing data from a 2.0.4 run that I found, or because I use a 
>> current tika trunk and not the existing binary that was on the server.
>>
>> I'm thinking of creating a new "A" with 2.0.4 with current tika trunk 
>> and then compare with the "B" I did.
>>
>> Tilman
>>
>>
>> Am 03.11.2017 um 22:14 schrieb Tilman Hausherr:
>>> Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.:
>>>> I'm not sure what you mean by...sorry
>>>>> - "H" is missing, which is identical to "C"
>>>
>>>
>>> I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM
>>>
>>> In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing. 
>>> Of course it is obvious that it has to be done, but I am a 
>>> perfectionist. I'd like to have this documentation for the "me" in a 
>>> few months when I have forgotten what I did the last days. Or for 
>>> the next person.
>>>
>>> Thanks for the fixes you did. I wonder why writing to /tmp didn't 
>>> work - it did work from the command line. I've started the command 
>>> again, I'm not sure when I will report about it. I'm a bit exhausted 
>>> from non-software activities :-(
>>>
>>> Tilman
>>>
>>>
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: Running tika-eval on the Rackspace vm

2017-11-03 Thread Allison, Timothy B.
Tilman,
  Thank you for the toe-stubbing.  I'm sorry that it wasn't easier...

I created a new user with collab permissions and ran through the process.

You are right about the privileges on the tmp directory... POI needs a tmp 
directory to write xlsx.  I created a tmp directory in /work/eval and added a 
direction to set tmp dir via -Djava.io.tmpdir=tmp

I'm not sure what you mean by...sorry
>- "H" is missing, which is identical to "C"

I updated the permissions on appBatchExecutor.sh

I also added a recommendation to umask g+rw before starting. 

Let me know if I need to fix anything else or if I missed something you've 
already identified but I missed. ☹

Thank you, again.

Best,

Tim

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Thursday, November 2, 2017 5:47 PM
To: dev@pdfbox.apache.org
Subject: Re: Running tika-eval on the Rackspace vm

I'm almost done... then I got this when doing the last step:


[tilman@cloud-server-02 eval]$ java -jar tika-eval-1.17-SNAPSHOT.jar Report -db 
pdfboxAvsB
0    [main] INFO  org.apache.tika.eval.reports.Report  - Writing report: 
All Mimes In A to mimes/all_mimes_A.xlsx Exception in thread "main" 
java.io.IOException: Permission denied
     at java.io.UnixFileSystem.createFileExclusively(Native Method)
     at java.io.File.createTempFile(File.java:2024)
     at
org.apache.poi.util.DefaultTempFileCreationStrategy.createTempFile(DefaultTempFileCreationStrategy.java:110)
     at org.apache.poi.util.TempFile.createTempFile(TempFile.java:66)
     at
org.apache.poi.xssf.streaming.SXSSFWorkbook.write(SXSSFWorkbook.java:924)
     at org.apache.tika.eval.reports.Report.dumpXLSX(Report.java:85)
     at org.apache.tika.eval.reports.Report.writeReport(Report.java:64)
     at
org.apache.tika.eval.reports.ResultsReporter.execute(ResultsReporter.java:305)
     at
org.apache.tika.eval.reports.ResultsReporter.main(ResultsReporter.java:266)
     at
org.apache.tika.eval.TikaEvalCLI.handleReport(TikaEvalCLI.java:264)
     at org.apache.tika.eval.TikaEvalCLI.execute(TikaEvalCLI.java:52)
     at org.apache.tika.eval.TikaEvalCLI.main(TikaEvalCLI.java:273)


I changed the source, and now I got the path, it is 
/work/eval/reports/mimes/all_mimes_A.xlsx . The file exists and it is empty.

I tried with a 1.16 version and the same happened.

Then I thought, maybe the file with the permission problem isn't the target at 
all; could this be some temp file / temp directory where I don't have 
permission?

smaller improvements for the documentation:

- appBatchExecutor.sh should have 775 permission or the documentation should 
have "nohup sh ./appBatchExecutor.sh &"

- "H" is missing, which is identical to "C"

- mention that "pdfboxAvsB" db files are to be removed before starting? 
I had accidentally aborted a run and couldn't restart.


Tilman

memo for me:


java -jar tika-eval-1.17-SNAPSHOT.jar Compare -extractsA
/data4/batch_runs/pdfbox_2_0_4 -extractsB
/data4/batch_runs/pdfbox_2_0_9-SNAPSHOT1 -db pdfboxAvsB

java -jar tika-eval-1.17-SNAPSHOT.jar Report -db pdfboxAvsB



RE: Running tika-eval on the Rackspace vm

2017-11-01 Thread Allison, Timothy B.
Sorry. Fixed.

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, October 31, 2017 6:08 PM
To: dev@pdfbox.apache.org
Subject: Re: Running tika-eval on the Rackspace vm

Am 31.10.2017 um 20:53 schrieb Allison, Timothy B.:
>> It's not possible to rename / remove the files / directories mentioned in 
>> part 1 due to not having the permissions.
> Gah.  Sorry.  Tilman, I added you to "collab" and chgrp to collab on /work 
> /data2/docs /data3/batch_runs and /data4/batch_runs.

But the directories themselves don't have "w" rights for group so I can't 
profit from my membership... (unless I missed something, I haven't done much 
*nix since the 90ies) For example I can't rename 
/work/batch-apps/tika_working/logs to /work/batch-apps/tika_working/___logs .

Tilman


>
>> The directory is named batch-apps, not batch_apps.
> Fixed.  Thank you.
>
>> Re the "A" version - is this the "good" version, so I could simply  download 
>> tika-app and put it there? Or just build tika with a specific  PDFBox 
>> version?
> If the current version of tika-app has the right version of PDFBox for your 
> "before" examples, then y, you can just download tika-app.jar.  We release 
> less frequently than PDFBox, so it's possible that you'll want to build from 
> scratch with the most recent previous release of PDFBox.
>
> In my mind, A is the "before/baseline" version and B is the 
> SNAPSHOT/RC version.  So, hopefully, B is the "good" one. 😊
>
> Let me know what other problems you encounter.
>
> Cheers,
>
>   Tim
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: Running tika-eval on the Rackspace vm

2017-10-31 Thread Allison, Timothy B.
> It's not possible to rename / remove the files / directories mentioned in 
> part 1 due to not having the permissions.

Gah.  Sorry.  Tilman, I added you to "collab" and chgrp to collab on /work 
/data2/docs /data3/batch_runs and /data4/batch_runs.

> The directory is named batch-apps, not batch_apps.
Fixed.  Thank you.

> Re the "A" version - is this the "good" version, so I could simply  download 
> tika-app and put it there? Or just build tika with a specific  PDFBox version?

If the current version of tika-app has the right version of PDFBox for your 
"before" examples, then y, you can just download tika-app.jar.  We release less 
frequently than PDFBox, so it's possible that you'll want to build from scratch 
with the most recent previous release of PDFBox.

In my mind, A is the "before/baseline" version and B is the SNAPSHOT/RC 
version.  So, hopefully, B is the "good" one. 😊

Let me know what other problems you encounter.

Cheers,

 Tim




RE: Running tika-eval on the Rackspace vm

2017-10-31 Thread Allison, Timothy B.
Will fix both.  Thank you!

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, October 30, 2017 4:21 PM
To: dev@pdfbox.apache.org
Subject: Re: Running tika-eval on the Rackspace vm

It's not possible to rename / remove the files / directories mentioned in part 
1 due to not having the permissions.

Tilman

Am 30.10.2017 um 14:14 schrieb Tilman Hausherr:
> I almost had some time today, so I had a look at 
> https://wiki.apache.org/tika/TikaEvalOnVM
>
> The directory is named batch-apps, not batch_apps.
>
> Re the "A" version - is this the "good" version, so I could simply 
> download tika-app and put it there? Or just build tika with a specific 
> PDFBox version?
>
> Tilman
>
> Am 23.10.2017 um 20:54 schrieb Allison, Timothy B.:
>> All,
>>
>> If anyone would like to join the fun in running tika-eval on the 
>> Rackspace vm, I posted this:
>> https://wiki.apache.org/tika/TikaEvalOnVM .  You’ll need access to 
>> the vm, of course, but I’m happy to grant that to anyone who wants to 
>> chip in and help with regression tests.  There are some areas for 
>> improvements in the process and documentation. 😊
>>
>> Cheers,
>>
>>  Tim
>>
>> P.S. For those who used the vm earlier and found it wonky, it was 
>> indeed wonky because I had failed to add a swap file.  With that 
>> change in place, the vm works quite well.
>>
>>
>>
>
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: [VOTE] Release Apache PDFBox 2.0.8

2017-10-30 Thread Allison, Timothy B.
+1

Thank you, Andreas, Tilman, and team!

Cheers,

Tim

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, October 30, 2017 3:57 PM
To: dev@pdfbox.apache.org
Subject: Re: [VOTE] Release Apache PDFBox 2.0.8

Am 30.10.2017 um 19:47 schrieb Andreas Lehmkuehler:
> Hi,
>
> a candidate for the PDFBox 2.0.8 release is available at:
>
>     https://dist.apache.org/repos/dist/dev/pdfbox/2.0.8/
>
> The release candidate is a zip archive of the sources in:
>
>     http://svn.apache.org/repos/asf/pdfbox/tags/2.0.8/
>
> The SHA1 checksum of the archive is
> 5c0607144dde1b7af3dd428cafbd2c9c29617ab3.
>
> Please vote on releasing this package as Apache PDFBox 2.0.8.
> The vote is open for the next 72 hours and passes if a majority of at 
> least three +1 PDFBox PMC votes are cast.
>
>     [ ] +1 Release this package as Apache PDFBox 2.0.8
>     [ ] -1 Do not release this package because...
>
>
> Here is my +1

+1

Tilman



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.8?

2017-10-27 Thread Allison, Timothy B.
Results:

http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take5.tar.gz


Haven't had a chance to review, nor have I had a chance to add the extra 
columns I promised. ☹


-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Thursday, October 26, 2017 1:26 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Am 26.10.2017 um 19:12 schrieb Andreas Lehmkuehler:
> Thanks Tim, looked promising.
>
> I'm planing to cut my second attempt next monday, if no one objects.

+1

Tilman

>
> @Tim I don't expect any new regressions, but if you have some cycles, 
> you might kick of another run.
>
> Andreas
>
> Am 23.10.2017 um 20:11 schrieb Allison, Timothy B.:
>> Reports here:
>> http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz
>>
>> I haven't looked yet.
>>
>> -Original Message-
>> From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
>> Sent: Sunday, October 22, 2017 4:15 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: 2.0.8?
>>
>> @Tim I've fixed the last open regression in 2.0.8, Tilmans test run 
>> hasn't
>> showed any regression. Please re-run your tests again to see if we 
>> can proceed
>> with 2.0.8, I'd really like to push it out.
>>
>> TIA again,
>> Andreas
>>
>>
>> Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler:
>>> Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.:
>>>>
>>>>> And yes, we need another regressions run if possible
>>>>
>>>> Sounds good.  Will do once I hear that we're good to go. Thank you!
>>> We are good now.
>>>
>>> @Tim: Could you please re-run your test to see how good we are?
>>>
>>> TIA,
>>> Andreas
>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.8?

2017-10-26 Thread Allison, Timothy B.
+1

Will do.

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Thursday, October 26, 2017 1:12 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Thanks Tim, looked promising.

I'm planing to cut my second attempt next monday, if no one objects.

@Tim I don't expect any new regressions, but if you have some cycles, you might 
kick of another run.

Andreas

Am 23.10.2017 um 20:11 schrieb Allison, Timothy B.:
> Reports here:
> http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz
> 
> I haven't looked yet.
> 
> -Original Message-
> From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
> Sent: Sunday, October 22, 2017 4:15 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.8?
> 
> @Tim I've fixed the last open regression in 2.0.8, Tilmans test run 
> hasn't showed any regression. Please re-run your tests again to see if 
> we can proceed with 2.0.8, I'd really like to push it out.
> 
> TIA again,
> Andreas
> 
> 
> Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler:
>> Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.:
>>>
>>>> And yes, we need another regressions run if possible
>>>
>>> Sounds good.  Will do once I hear that we're good to go.  Thank you!
>> We are good now.
>>
>> @Tim: Could you please re-run your test to see how good we are?
>>
>> TIA,
>> Andreas
>>
>>>
>>> 
>>> - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



Running tika-eval on the Rackspace vm

2017-10-23 Thread Allison, Timothy B.
All,

If anyone would like to join the fun in running tika-eval on the Rackspace vm, 
I posted this: https://wiki.apache.org/tika/TikaEvalOnVM .  You’ll need access 
to the vm, of course, but I’m happy to grant that to anyone who wants to chip 
in and help with regression tests.  There are some areas for improvements in 
the process and documentation. 😊

Cheers,

Tim

P.S. For those who used the vm earlier and found it wonky, it was indeed wonky 
because I had failed to add a swap file.  With that change in place, the vm 
works quite well.





RE: 2.0.8?

2017-10-23 Thread Allison, Timothy B.
Reports here:
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz 

I haven't looked yet.

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Sunday, October 22, 2017 4:15 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

@Tim I've fixed the last open regression in 2.0.8, Tilmans test run hasn't 
showed any regression. Please re-run your tests again to see if we can proceed 
with 2.0.8, I'd really like to push it out.

TIA again,
Andreas


Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler:
> Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.:
>>
>>> And yes, we need another regressions run if possible
>>
>> Sounds good.  Will do once I hear that we're good to go.  Thank you!
> We are good now.
> 
> @Tim: Could you please re-run your test to see how good we are?
> 
> TIA,
> Andreas
> 
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.8?

2017-10-23 Thread Allison, Timothy B.
Kicked off process.

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Sunday, October 22, 2017 4:15 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

@Tim I've fixed the last open regression in 2.0.8, Tilmans test run hasn't 
showed any regression. Please re-run your tests again to see if we can proceed 
with 2.0.8, I'd really like to push it out.

TIA again,
Andreas


Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler:
> Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.:
>>
>>> And yes, we need another regressions run if possible
>>
>> Sounds good.  Will do once I hear that we're good to go.  Thank you!
> We are good now.
> 
> @Tim: Could you please re-run your test to see how good we are?
> 
> TIA,
> Andreas
> 
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.8?

2017-10-10 Thread Allison, Timothy B.

> However, PDFBox 2.0.8-SNAPSHOT has a more 0, 1, 2 and 3s...
>
> The TOP_10_MORE_IN_B column in the contents report shows that there are 15 
> more 0's, 15 more 1's 11 more '2's etc.
>
> 0: 15 | 1: 15 | 2: 11 | 20: 5 | 3: 2 | 4: 2

>Yeah but where do they come from? Not from the pure text extraction. In the 
>json files, I see that there are

>many "0:", "1:" in the new file. I wonder if this is about acroform fiels? Can 
>be seen e.g. near for 
>b12c96nfdate36.

Sorry, right, AcroForm.  We're now getting some children we weren't before.

2.0.8-SNAPSHOT:
@@b12c96nfdate362: 
0:   
1:   
2: 20  

b12c96nfdate362: 20
2.0.7:
@@b12c96nfdate362: 
b12c96nfdate362: 20


RE: 2.0.8?

2017-10-10 Thread Allison, Timothy B.
If we're talking about the same file...same number of pages, attachments and 
common words.

However, PDFBox 2.0.8-SNAPSHOT has a more 0, 1, 2 and 3s...

The TOP_10_MORE_IN_B column in the contents report shows that there are 15 more 
0's, 15 more 1's 11 more '2's etc.

0: 15 | 1: 15 | 2: 11 | 20: 5 | 3: 2 | 4: 2



-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, October 10, 2017 11:47 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Am 09.10.2017 um 22:26 schrieb Allison, Timothy B.:
> Thank you, Andreas, for fixing the slow parse on corrupt file so quickly!
>
> Reports are here:
> http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz
>

Tim, can you please find out what we lost with 254348.pdf? It's not in the text 
extraction, so I assume it's some meta data but I don't see where.

Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.8?

2017-10-10 Thread Allison, Timothy B.
Sorry.  I just saw this.  I ln'd the json extracts so that you can pull them 
easily:

http://162.242.228.174/extracts/pdfbox_2_0_7/
http://162.242.228.174/extracts/pdfbox_2_0_8-SNAPSHOT/

I'll take a  look 254348.pdf
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, October 10, 2017 11:47 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Am 09.10.2017 um 22:26 schrieb Allison, Timothy B.:
> Thank you, Andreas, for fixing the slow parse on corrupt file so quickly!
>
> Reports are here:
> http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz
>

Tim, can you please find out what we lost with 254348.pdf? It's not in the text 
extraction, so I assume it's some meta data but I don't see where.

Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.8?

2017-10-09 Thread Allison, Timothy B.
Apologies, but I haven't gotten around to adding the exception columns in the 
content comparison tables, including the "page count diffs" table.

I also haven't had a chance to read/make sense of the reports yet, but I wanted 
to share asap.

Best,

Tim

-Original Message-
From: Allison, Timothy B. 
Sent: Monday, October 9, 2017 4:26 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.8?

Thank you, Andreas, for fixing the slow parse on corrupt file so quickly!

Reports are here:
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, October 9, 2017 8:02 AM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.8?

Starting process now.

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Sunday, October 8, 2017 10:12 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.:
> 
>> And yes, we need another regressions run if possible
> 
> Sounds good.  Will do once I hear that we're good to go.  Thank you!
We are good now.

@Tim: Could you please re-run your test to see how good we are?

TIA,
Andreas

> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.8?

2017-10-09 Thread Allison, Timothy B.
Thank you, Andreas, for fixing the slow parse on corrupt file so quickly!

Reports are here:
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, October 9, 2017 8:02 AM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.8?

Starting process now.

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Sunday, October 8, 2017 10:12 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.:
> 
>> And yes, we need another regressions run if possible
> 
> Sounds good.  Will do once I hear that we're good to go.  Thank you!
We are good now.

@Tim: Could you please re-run your test to see how good we are?

TIA,
Andreas

> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.8?

2017-10-09 Thread Allison, Timothy B.
Starting process now.

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Sunday, October 8, 2017 10:12 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.:
> 
>> And yes, we need another regressions run if possible
> 
> Sounds good.  Will do once I hear that we're good to go.  Thank you!
We are good now.

@Tim: Could you please re-run your test to see how good we are?

TIA,
Andreas

> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.8?

2017-10-03 Thread Allison, Timothy B.

>And yes, we need another regressions run if possible

Sounds good.  Will do once I hear that we're good to go.  Thank you!

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.8?

2017-10-03 Thread Allison, Timothy B.
>>Let me know when we're ready for another round.
>I've already started ...


RC2?  No need for another regression run?

Thank you again!


Re: 2.0.8?

2017-10-03 Thread Allison, Timothy B.
All,

  Again, my apologies for post-useful/late results!  Ugh...  

  Thank you, Andreas and Tilman!

  Let me know when we're ready for another round.

  Cheers,

   Tim
-Original Message-
From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org] 
Sent: Tuesday, October 3, 2017 8:23 AM
To: dev@pdfbox.apache.org
Subject: [jira] [Resolved] (PDFBOX-3949) NPE in bfSearchForObjStreams


 [ 
https://issues.apache.org/jira/browse/PDFBOX-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-3949.

   Resolution: Fixed
Fix Version/s: 3.0.0
   2.0.8

I've optimized the brute force search for object streams.

Thanks [~talli...@mitre.org] and [~tilman] for the finding

> NPE in bfSearchForObjStreams
> 
>
> Key: PDFBOX-3949
> URL: https://issues.apache.org/jira/browse/PDFBOX-3949
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.8
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.8, 3.0.0
>
> Attachments: MKFYUGZWS3OPXLLVU2Z4LWCTVA5WNOGF.pdf
>
>
> {code}
> java.lang.NullPointerException: null
> 
> org.apache.pdfbox.pdfparser.COSParser.bfSearchForObjStreams(COSParser.java:1738)
> 
> org.apache.pdfbox.pdfparser.COSParser.bfSearchForObjects(COSParser.java:1529)
> 
> org.apache.pdfbox.pdfparser.COSParser.getBFCOSObjectOffsets(COSParser.java:1445)
> 
> org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:19
> 05)
> {code}
> This worked in 2.0.7. The exception happens in 39 files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.8?

2017-10-02 Thread Allison, Timothy B.

> Re 308576.pdf: the text extraction has a huge loss, but a manual check shows 
> it is identical. However that file has the NPE from PDActionURI.getURI(), 
> could it be that this results in an abort of text extraction?
Same for 569017.pdf.

Likely.  There are two "per file pair contents" files.  The one ending with 
"_ignore_exceptions.xlsx" means that results are not reported if there was an 
exception caught for one of the files (308576.pdf and 569017.pdf aren't in that 
file).  The other one "*_with_exceptions" includes both.  Based on your 
feedback, I should add 2 boolean cols to "*_with_exceptions.xlsx" for 
exceptionInA and exceptionInB?


RE: 2.0.8?

2017-10-02 Thread Allison, Timothy B.
Sorry all for taking longer than expected!  File under "this information would 
have been useful..." ☹

-Original Message-----
From: Allison, Timothy B. 
Sent: Monday, October 2, 2017 3:59 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.8?

Reports are here:
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take2.tar.gz

Looks like some new NPEs.  I'll take a look at the metadata diffs.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, October 2, 2017 9:24 AM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.8?

>>>Email originates from a non-MITRE system. Use caution.<<<

Sounds good.  

I kicked off the eval process yesterday, but because of a bug in our 
config-file reader and/or user error in modifying the config file, I wound up 
with 500k pdfs parsed by our EmptyParserno results.

I restarted the eval process just now. I should have results in 6 hours.



-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Sunday, October 1, 2017 6:31 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Am 25.09.2017 um 18:39 schrieb Andreas Lehmkuehler:
> Am 25.09.2017 um 12:30 schrieb Maruan Sahyoun:
>> Hi,
>>>> Andreas Lehmkuehler  hat am 13. September 2017 um
>>>> 20:33
>>>> geschrieben:
>>>>
>>>>
>>>> Due to the responses I'm planning to cut the release on Monday the 
>>>> 25th
>>>
>>> I'm still working on a solution for PDFBOX-3934 to avoid the 
>>> regression with PDFBOX-3318. Should we postpone the release for a 
>>> couple of days or a week max? Or should I simply revert my changes?
>>
>> I'd go for postponing in order to fix that regression - what about 
>> setting the date to next Monday?
> OK, let's postpone, I'm targeting next Monday. Thanks for your 
> patience ;-)
Just a friendly reminder, I'm going to cut the release in about 30 hours from 
now.

Andreas

> 
> Andreas
>>
>> BR
>> Maruan
>>
>>>
>>> WDYT?
>>>
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.8?

2017-10-02 Thread Allison, Timothy B.
Reports are here:
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take2.tar.gz

Looks like some new NPEs.  I'll take a look at the metadata diffs.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, October 2, 2017 9:24 AM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.8?

>>>Email originates from a non-MITRE system. Use caution.<<<

Sounds good.  

I kicked off the eval process yesterday, but because of a bug in our 
config-file reader and/or user error in modifying the config file, I wound up 
with 500k pdfs parsed by our EmptyParserno results.

I restarted the eval process just now. I should have results in 6 hours.



-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Sunday, October 1, 2017 6:31 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Am 25.09.2017 um 18:39 schrieb Andreas Lehmkuehler:
> Am 25.09.2017 um 12:30 schrieb Maruan Sahyoun:
>> Hi,
>>>> Andreas Lehmkuehler  hat am 13. September 2017 um
>>>> 20:33
>>>> geschrieben:
>>>>
>>>>
>>>> Due to the responses I'm planning to cut the release on Monday the 
>>>> 25th
>>>
>>> I'm still working on a solution for PDFBOX-3934 to avoid the 
>>> regression with PDFBOX-3318. Should we postpone the release for a 
>>> couple of days or a week max? Or should I simply revert my changes?
>>
>> I'd go for postponing in order to fix that regression - what about 
>> setting the date to next Monday?
> OK, let's postpone, I'm targeting next Monday. Thanks for your 
> patience ;-)
Just a friendly reminder, I'm going to cut the release in about 30 hours from 
now.

Andreas

> 
> Andreas
>>
>> BR
>> Maruan
>>
>>>
>>> WDYT?
>>>
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.8?

2017-10-02 Thread Allison, Timothy B.
Sounds good.  

I kicked off the eval process yesterday, but because of a bug in our 
config-file reader and/or user error in modifying the config file, I wound up 
with 500k pdfs parsed by our EmptyParserno results.

I restarted the eval process just now. I should have results in 6 hours.



-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Sunday, October 1, 2017 6:31 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Am 25.09.2017 um 18:39 schrieb Andreas Lehmkuehler:
> Am 25.09.2017 um 12:30 schrieb Maruan Sahyoun:
>> Hi,
 Andreas Lehmkuehler  hat am 13. September 2017 um 
 20:33
 geschrieben:


 Due to the responses I'm planning to cut the release on Monday the 
 25th
>>>
>>> I'm still working on a solution for PDFBOX-3934 to avoid the 
>>> regression with PDFBOX-3318. Should we postpone the release for a 
>>> couple of days or a week max? Or should I simply revert my changes?
>>
>> I'd go for postponing in order to fix that regression - what about 
>> setting the date to next Monday?
> OK, let's postpone, I'm targeting next Monday. Thanks for your 
> patience ;-)
Just a friendly reminder, I'm going to cut the release in about 30 hours from 
now.

Andreas

> 
> Andreas
>>
>> BR
>> Maruan
>>
>>>
>>> WDYT?
>>>
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.8?

2017-09-25 Thread Allison, Timothy B.
> I'd go for postponing in order to fix that regression - what about setting 
> the date to next Monday?

+1 I’m happy pushing it out later if the fix happens >= Friday and we want to 
run the full regression tests again.

Thank you, Andreas!

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.8?

2017-09-18 Thread Allison, Timothy B.
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8-SNAPSHOT_reports.tar.gz

is now available.  I haven't yet had a chance to look at either...

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, September 18, 2017 12:51 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.8?

Reports for 2.0.4 vs 2.0.8-SNAPSHOT (r1808067) are available:

http://162.242.228.174/reports/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports.tar.gz

I'll post 2.0.7 vs 2.0.8-SNAPSHOT in the next few hours.



-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Wednesday, September 13, 2017 2:33 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Due to the responses I'm planning to cut the release on Monday the 25th

Andreas

Am 12.09.2017 um 06:43 schrieb Andreas Lehmkuehler:
> Good idea, there are already a lot of solved tickets for 2.0.8
> 
> @all Is there anything pending which should be included?
> 
> How about cutting the release in a week or two from now?
> 
> @Tim please run a test 2.0.7 vs. 2.0.8 if possible
> 
> Andreas
> 
> Am 11.09.2017 um 23:24 schrieb Allison, Timothy B.:
>>> I hope there aren't any new regressions.
>>
>> Happy to help find them!  :)
>>
>> On a related note, do we have a sense of the schedule for PDFBox 
>> 2.0.8?  I'd like to include it in Tika's last Java 7 release...end of 
>> Sept, middle of Oct., or whenever 2.0.8 is out. :)
>>
>>
>> -Original Message-
>> From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org]
>> Sent: Monday, September 11, 2017 4:52 PM
>> To: dev@pdfbox.apache.org
>> Subject: [jira] [Comment Edited] (PDFBOX-3928)
>> IllegalArgumentException: root cannot be null with truncated file
>>
>>
>>  [
>> https://issues.apache.org/jira/browse/PDFBOX-3928?page=com.atlassian.
>> jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1
>> 6161965#comment-16161965
>> ]
>>
>> Andreas Lehmkühler edited comment on PDFBOX-3928 at 9/11/17 8:51 PM:
>> -
>>
>> Both case are tricky (PDFBOX-3798 is truncated within an object and 
>> the attached pdf has a truncated xref table), so that I had to 
>> improve the brute force search one more time.
>> [~tilman] thanks for the finding. I hope there aren't any new regressions.
>>
>>
>> was (Author: lehmi):
>> Both case are tricky, so that I had to improve the brute force search 
>> one more time.
>> [~tilman] thanks for the finding. I hope there aren't any new regressions.
>>
>>> IllegalArgumentException: root cannot be null with truncated file
>>> -
>>>
>>>  Key: PDFBOX-3928
>>>  URL: 
>>> https://issues.apache.org/jira/browse/PDFBOX-3928
>>>  Project: PDFBox
>>>   Issue Type: Bug
>>>   Components: Parsing
>>>     Affects Versions: 2.0.7
>>>     Reporter: Tilman Hausherr
>>>     Assignee: Andreas Lehmkühler
>>>   Labels: regression
>>>  Fix For: 2.0.8, 3.0.0
>>>
>>>  Attachments: 023505.pdf
>>>
>>>
>>> {code}
>>> java.lang.IllegalArgumentException: root cannot be null
>>>  org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75)
>>>  
>>> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatal
>>> og.java:129)
>>>  
>>> org.apache.pdfbox.pdmodel.PDDocument.getPages(PDDocument.java:1388)
>>>  
>>> org.apache.pdfbox.debugger.ui.DocumentEntry.getPageCount(DocumentEnt
>>> ry.java:42)
>>>  
>>> org.apache.pdfbox.debugger.ui.PDFTreeModel.getChildCount(PDFTreeMode
>>> l.java:195)
>>>  java.desktop/java.beans.PropertyChangeSupport.fire(Unknown
>>> Source)
>>>  
>>> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unk
>>> nown
>>> Source)
>>>  
>>> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unk
>>> nown
>>> Source)
>>>  
>>> org.apache.pdfbox.debugger.PDFDebugger.initTree(PDFDebugger.java:128
>>> 8)
>>>  
>>> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:
>>> 1235)
>>>  
>>> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:
>>> 12

RE: 2.0.8?

2017-09-18 Thread Allison, Timothy B.
Reports for 2.0.4 vs 2.0.8-SNAPSHOT (r1808067) are available:

http://162.242.228.174/reports/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports.tar.gz

I'll post 2.0.7 vs 2.0.8-SNAPSHOT in the next few hours.



-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Wednesday, September 13, 2017 2:33 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?

Due to the responses I'm planning to cut the release on Monday the 25th

Andreas

Am 12.09.2017 um 06:43 schrieb Andreas Lehmkuehler:
> Good idea, there are already a lot of solved tickets for 2.0.8
> 
> @all Is there anything pending which should be included?
> 
> How about cutting the release in a week or two from now?
> 
> @Tim please run a test 2.0.7 vs. 2.0.8 if possible
> 
> Andreas
> 
> Am 11.09.2017 um 23:24 schrieb Allison, Timothy B.:
>>> I hope there aren't any new regressions.
>>
>> Happy to help find them!  :)
>>
>> On a related note, do we have a sense of the schedule for PDFBox 
>> 2.0.8?  I'd like to include it in Tika's last Java 7 release...end of 
>> Sept, middle of Oct., or whenever 2.0.8 is out. :)
>>
>>
>> -Original Message-
>> From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org]
>> Sent: Monday, September 11, 2017 4:52 PM
>> To: dev@pdfbox.apache.org
>> Subject: [jira] [Comment Edited] (PDFBOX-3928) 
>> IllegalArgumentException: root cannot be null with truncated file
>>
>>
>>  [
>> https://issues.apache.org/jira/browse/PDFBOX-3928?page=com.atlassian.
>> jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1
>> 6161965#comment-16161965
>> ]
>>
>> Andreas Lehmkühler edited comment on PDFBOX-3928 at 9/11/17 8:51 PM:
>> -
>>
>> Both case are tricky (PDFBOX-3798 is truncated within an object and 
>> the attached pdf has a truncated xref table), so that I had to 
>> improve the brute force search one more time.
>> [~tilman] thanks for the finding. I hope there aren't any new regressions.
>>
>>
>> was (Author: lehmi):
>> Both case are tricky, so that I had to improve the brute force search 
>> one more time.
>> [~tilman] thanks for the finding. I hope there aren't any new regressions.
>>
>>> IllegalArgumentException: root cannot be null with truncated file
>>> -
>>>
>>>  Key: PDFBOX-3928
>>>  URL: 
>>> https://issues.apache.org/jira/browse/PDFBOX-3928
>>>  Project: PDFBox
>>>   Issue Type: Bug
>>>   Components: Parsing
>>>     Affects Versions: 2.0.7
>>>     Reporter: Tilman Hausherr
>>>     Assignee: Andreas Lehmkühler
>>>   Labels: regression
>>>  Fix For: 2.0.8, 3.0.0
>>>
>>>  Attachments: 023505.pdf
>>>
>>>
>>> {code}
>>> java.lang.IllegalArgumentException: root cannot be null
>>>  org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75)
>>>  
>>> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatal
>>> og.java:129)
>>>  
>>> org.apache.pdfbox.pdmodel.PDDocument.getPages(PDDocument.java:1388)
>>>  
>>> org.apache.pdfbox.debugger.ui.DocumentEntry.getPageCount(DocumentEnt
>>> ry.java:42)
>>>  
>>> org.apache.pdfbox.debugger.ui.PDFTreeModel.getChildCount(PDFTreeMode
>>> l.java:195)
>>>  java.desktop/java.beans.PropertyChangeSupport.fire(Unknown 
>>> Source)
>>>  
>>> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unk
>>> nown
>>> Source)
>>>  
>>> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unk
>>> nown
>>> Source)
>>>  
>>> org.apache.pdfbox.debugger.PDFDebugger.initTree(PDFDebugger.java:128
>>> 8)
>>>  
>>> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:
>>> 1235)
>>>  
>>> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:
>>> 1218)
>>>  
>>> org.apache.pdfbox.debugger.PDFDebugger.main(PDFDebugger.java:1209)
>>>  org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:85)
>>> {code}
>>> This worked in 2.0.6, but no longer in 2.0.7. It happens since [
>>> https://svn.apache.org/r1795705 ] of PDFBOX-3798.
>>
>

RE: 2.0.8?

2017-09-12 Thread Allison, Timothy B.
> because I'm ill but I expect to be my old self later this week.

I'm sorry to hear it!  I hope that you are feeling better soon!

> I'd also like to have a test from version 2.0.4 compared to trunk because 
> 2.0.5 was the version were the tests weren't done, the problems were fixed in 
> 2.0.6 but at that time we tested only 2.0.5 against 2.0.6.

I was just thinking the same thing, but without the specific versions in mind.  
:) Great idea.  Will do over the next week...



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



2.0.8?

2017-09-11 Thread Allison, Timothy B.
>I hope there aren't any new regressions.

Happy to help find them!  :)

On a related note, do we have a sense of the schedule for PDFBox 2.0.8?  I'd 
like to include it in Tika's last Java 7 release...end of Sept, middle of Oct., 
or whenever 2.0.8 is out. :)


-Original Message-
From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org] 
Sent: Monday, September 11, 2017 4:52 PM
To: dev@pdfbox.apache.org
Subject: [jira] [Comment Edited] (PDFBOX-3928) IllegalArgumentException: root 
cannot be null with truncated file


[ 
https://issues.apache.org/jira/browse/PDFBOX-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161965#comment-16161965
 ] 

Andreas Lehmkühler edited comment on PDFBOX-3928 at 9/11/17 8:51 PM:
-

Both case are tricky (PDFBOX-3798 is truncated within an object and the 
attached pdf has a truncated xref table), so that I had to improve the brute 
force search one more time. 
[~tilman] thanks for the finding. I hope there aren't any new regressions.


was (Author: lehmi):
Both case are tricky, so that I had to improve the brute force search one more 
time. 
[~tilman] thanks for the finding. I hope there aren't any new regressions.

> IllegalArgumentException: root cannot be null with truncated file
> -
>
> Key: PDFBOX-3928
> URL: https://issues.apache.org/jira/browse/PDFBOX-3928
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.7
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.8, 3.0.0
>
> Attachments: 023505.pdf
>
>
> {code}
> java.lang.IllegalArgumentException: root cannot be null
> org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75)
> 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:129)
> org.apache.pdfbox.pdmodel.PDDocument.getPages(PDDocument.java:1388)
> 
> org.apache.pdfbox.debugger.ui.DocumentEntry.getPageCount(DocumentEntry.java:42)
> 
> org.apache.pdfbox.debugger.ui.PDFTreeModel.getChildCount(PDFTreeModel.java:195)
> java.desktop/java.beans.PropertyChangeSupport.fire(Unknown Source)
> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unknown 
> Source)
> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unknown 
> Source)
> org.apache.pdfbox.debugger.PDFDebugger.initTree(PDFDebugger.java:1288)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1235)
> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1218)
> org.apache.pdfbox.debugger.PDFDebugger.main(PDFDebugger.java:1209)
> org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:85)
> {code}
> This worked in 2.0.6, but no longer in 2.0.7. It happens since [ 
> https://svn.apache.org/r1795705 ] of PDFBOX-3798.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

2017-08-15 Thread Allison, Timothy B.
Thank you, Maruan!  I opened PDFBOX-3898 after breaking out the spec...I may be 
misreading it, tho!

-Original Message-
From: Maruan Sahyoun [mailto:sahy...@fileaffairs.de] 
Sent: Tuesday, August 15, 2017 11:58 AM
To: dev@pdfbox.apache.org
Subject: Re: [jira] [Commented] (TIKA-2442) Non-terminal interactive form 
fields not handled recursively

Hi Tim,

> Am 15.08.2017 um 17:31 schrieb Allison, Timothy B. :
> 
> All,
>  I can't tell if the triggering file is corrupt or how we want to handle it 
> on the PDFBox side.  The problem is that the parent node is a PDTextField -- 
> a PDTerminalField -- so we don't/can't look for children, even though it 
> actually does have pointers in Kids.

I had a quick look with the debugger and the file looks fine. There is nothing 
wrong with a non terminal field having a field type /FT and the kids (terminal 
fields) having not. In such case the field type should be taken for the kids.

Which vesion of PDFBox is Tika 1.14 on?

BR
Maruan


> 
> The output from PrintFields is:
> 
> 1 top-level fields were found on the form
> |--parent.parent = ,  
> |type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField
> 
> -Original Message-
> From: Tim Allison (JIRA) [mailto:j...@apache.org]
> Sent: Monday, August 14, 2017 10:36 AM
> To: d...@tika.apache.org
> Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form 
> fields not handled recursively
> 
> 
>[ 
> https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jir
> a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125
> 756#comment-16125756 ]
> 
>> Non-terminal interactive form fields not handled recursively
>> 
>> 
>>Key: TIKA-2442
>>URL: https://issues.apache.org/jira/browse/TIKA-2442
>>Project: Tika
>> Issue Type: Bug
>> Components: parser
>>   Affects Versions: 1.14
>>   Reporter: Christopher Creutzig
>>Attachments: simple-form.pdf
>> 
>> 
>> (I am not sure if this is a Tika or a PDFBox problem; I tried finding 
>> a form extractor in PDFBox, but the app api does not have one. PDFDebugger 
>> does show me the expected tree structure.) The attached PDF has a 
>> non-terminal field named “parent” and two children, “child1” and “child2.” 
>> According to the PDF spec in section 8.6, the fully qualified field names 
>> should be parent.child1 and parent.child2. That is the output given by pdftk:
>>> pdftk simple-form.pdf dump_data_fields
>> ---
>> FieldType: Text
>> FieldName: parent.child1
>> FieldFlags: 0
>> FieldValue: child1 value
>> FieldJustification: Left
>> ---
>> FieldType: Text
>> FieldName: parent.child2
>> FieldFlags: 0
>> FieldValue: child2 value
>> FieldJustification: Left
>> Tika with the ToXMLContentHandler seems to silently ignore the children, 
>> however, returning only a parent with no value.
>> Calling code:
>> import java.io.FileInputStream;
>> import org.apache.tika.detect.DefaultDetector;
>> import org.apache.tika.detect.Detector; import 
>> org.apache.tika.metadata.Metadata;
>> import org.apache.tika.parser.AutoDetectParser;
>> import org.apache.tika.parser.ParseContext;
>> import org.apache.tika.parser.Parser; import 
>> org.apache.tika.parser.PasswordProvider;
>> import org.apache.tika.sax.ToXMLContentHandler;
>> class readAsXHTML {
>>  public static String readAsXHTML(String filename) throws Exception {
>>ToXMLContentHandler handler = new ToXMLContentHandler();
>>Detector detector = new DefaultDetector();
>>Parser parser = new AutoDetectParser(detector);
>>ParseContext context = new ParseContext();
>>Metadata metadata = new Metadata();
>>FileInputStream fh = null;
>>final String pass = password;
>>try {
>>  fh = new FileInputStream(filename);
>>  parser.parse(fh, handler, metadata, context);
>> 
>>  return(handler.toString());
>>}
>>finally {
>>  if (fh != null) {
>>fh.close();
>>  }
>>}
>>  }
>> }
>> Abbreviated output:
>> 
>> 
>>parent: 
>> 
>> 
>> 
>> Expected:
>> 
>> 
>> 
>>  parent.child1: child1 value
>>  parent.child2: child2 value   
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.4.14#64029)
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



FW: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

2017-08-15 Thread Allison, Timothy B.
All,
  I can't tell if the triggering file is corrupt or how we want to handle it on 
the PDFBox side.  The problem is that the parent node is a PDTextField -- a 
PDTerminalField -- so we don't/can't look for children, even though it actually 
does have pointers in Kids.

The output from PrintFields is:

1 top-level fields were found on the form
|--parent.parent = ,  
type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField

-Original Message-
From: Tim Allison (JIRA) [mailto:j...@apache.org] 
Sent: Monday, August 14, 2017 10:36 AM
To: d...@tika.apache.org
Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields 
not handled recursively


[ 
https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125756#comment-16125756
 ] 

> Non-terminal interactive form fields not handled recursively
> 
>
> Key: TIKA-2442
> URL: https://issues.apache.org/jira/browse/TIKA-2442
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Christopher Creutzig
> Attachments: simple-form.pdf
>
>
> (I am not sure if this is a Tika or a PDFBox problem; I tried finding 
> a form extractor in PDFBox, but the app api does not have one. PDFDebugger 
> does show me the expected tree structure.) The attached PDF has a 
> non-terminal field named “parent” and two children, “child1” and “child2.” 
> According to the PDF spec in section 8.6, the fully qualified field names 
> should be parent.child1 and parent.child2. That is the output given by pdftk:
> > pdftk simple-form.pdf dump_data_fields
> ---
> FieldType: Text
> FieldName: parent.child1
> FieldFlags: 0
> FieldValue: child1 value
> FieldJustification: Left
> ---
> FieldType: Text
> FieldName: parent.child2
> FieldFlags: 0
> FieldValue: child2 value
> FieldJustification: Left
> Tika with the ToXMLContentHandler seems to silently ignore the children, 
> however, returning only a parent with no value.
> Calling code:
> import java.io.FileInputStream;
> import org.apache.tika.detect.DefaultDetector;
> import org.apache.tika.detect.Detector; import 
> org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.PasswordProvider;
> import org.apache.tika.sax.ToXMLContentHandler;
> class readAsXHTML {
>   public static String readAsXHTML(String filename) throws Exception {
> ToXMLContentHandler handler = new ToXMLContentHandler();
> Detector detector = new DefaultDetector();
> Parser parser = new AutoDetectParser(detector);
> ParseContext context = new ParseContext();
> Metadata metadata = new Metadata();
> FileInputStream fh = null;
> final String pass = password;
> try {
>   fh = new FileInputStream(filename);
>   parser.parse(fh, handler, metadata, context);
>   
>   return(handler.toString());
> }
> finally {
>   if (fh != null) {
> fh.close();
>   }
> }
>   }
> }
> Abbreviated output:
> 
> 
> parent: 
> 
> 
> 
> Expected:
> 
> 
> 
>   parent.child1: child1 value
>   parent.child2: child2 value   



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



FW: Tika content detection and crawled "remote" content

2017-07-05 Thread Allison, Timothy B.
All,

> If anyone is interested in using the detected MIME types or anything else 
> from Common Crawl - I'm happy to help!  The URL index [4] contains now a new 
> field "mime-detected" which makes it easy to search or grep for confusion 
> pairs.

This is an amazing step forward for sampling PDF files from Common Crawl.  I 
used to rely on the http-headers and/or file suffix, but now we also have 
Tika's judgment on every file in Common Crawl.

We still have to deal with the 1MB truncation (I think), but this is an amazing 
development.  Thank you, Sebastian!

Cheers,

 Tim

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: Tuesday, July 4, 2017 6:18 AM
To: u...@tika.apache.org
Subject: Tika content detection and crawled "remote" content

Hi,

recently I've plugged in Tika's content detection into Common Crawl's crawler 
(modified Nutch) with the target to get clean and correct MIME type - the HTTP 
Content-Type may contain garbage and isn't always correct [1].

For the June 2017 crawl I've prepared a comparison of content types sent by the 
server in the HTTP header and as detected by Tika 1.15 [2].  It shows that 
content types by Tika are definitely clean
(1,400 different content types vs. more than 6,000 content type "strings" from 
HTTP headers).

A look on the "confusions" where Content-Type and Tika differ, shows a mixed 
picture: some pairs are plausible, e.g., if Tika changes the type to a more 
precise subtype or detects the MIME at all:

Tika-1.15HTTP-Content-Type
1001968023  application/xhtml+xmltext/html
   2298146  application/rss+xml  text/xml
617435  application/rss+xml  application/xml
613525  text/htmlunk
361525  application/xhtml+xmlunk
297707  application/rdf+xml  application/xml


However, there are a few dubious decisions, esp. the group of web server-side 
scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

 Tika-1.15 HTTP-Content-Type
2047739  text/x-phptext/html
 681629  text/asp  text/html
 193095  text/x-coldfusion text/html
 172318  text/aspdotnettext/html
 139033  text/x-jsptext/html
  38415  text/x-cgitext/html
  32092  text/x-phptext/xml
  18021  text/x-perl   text/html

Of course, due to misconfigurations some servers may deliver the script files 
unmodified but in general I wouldn't expect that this happens for millions of 
pages.  I've checked some of the affected URLs:

- HTML fragment (no declaration of  or  opening tag)

https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
http://www.privi.com/product-details.asp?cno=C10910011
http://mental-ray.de/Root_alt/Default.asp
http://ekyrs.org/support/index.php?action=profile
http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

- (overlong) comment block at start of HTML which "masks" the HTML declaration
http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6

https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
https://de.e-stories.org/categories.php?&lan=nl&art=p

- HTML with some scripting fragments ("") present:
http://www.eco-ani-yao.org/shien/

- others are clearly HTML (looks more like a bug, at least, there is no simple 
explanation)
http://www.proedinc.com/customer/content.aspx?redid=9

http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79


Obviously certain file suffixes (.php, .aspx) should get less weight compared 
to Content-Type sent from the responding server.
Now my question: where's the best place to fix this: in the crawler [3] or in 
Tika?

If anyone is interested in using the detected MIME types or anything else from 
Common Crawl - I'm happy to help!  The URL index [4] contains now a new field 
"mime-detected" which makes it easy to search or grep for confusion pairs.


Thanks and best,
Sebastian


[1] https://github.com/commoncrawl/nutch/issues/3
[2] 
s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
[3] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
[4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: d

RE: tika-eval

2017-05-22 Thread Allison, Timothy B.
Ha.  I hadn't realized the video was available until this post.  Thank you!

> And here is the talk about it Tim gave at ApacheCon
>
> https://youtu.be/vRPTPMwI53k?list=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp
>
> I've enjoyed it (the video). 

So did I!

Tilman



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: [VOTE] Release Apache PDFBox 2.0.6

2017-05-12 Thread Allison, Timothy B.
+1 

Thank you!

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Friday, May 12, 2017 12:13 PM
To: dev@pdfbox.apache.org
Subject: [VOTE] Release Apache PDFBox 2.0.6

Hi,

a candidate for the PDFBox 2.0.6 release is available at:

 https://dist.apache.org/repos/dist/dev/pdfbox/2.0.6/

The release candidate is a zip archive of the sources in:

 http://svn.apache.org/repos/asf/pdfbox/tags/2.0.6/

The SHA1 checksum of the archive is cb04fa19058efca6913a45490ac66cf44ecf273a.

Please vote on releasing this package as Apache PDFBox 2.0.6.
The vote is open for the next 72 hours and passes if a majority of at least 
three +1 PDFBox PMC votes are cast.

 [ ] +1 Release this package as Apache PDFBox 2.0.6
 [ ] -1 Do not release this package because...


Here is my +1

BR
Andreas Lehmkühler

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-12 Thread Allison, Timothy B.

http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz

Looks good to me on a very cursory look.




RE: 2.0.6 release ?

2017-05-11 Thread Allison, Timothy B.
> It isn't that secret as Tim posted it somewhere in this thread

:)

I've added throttling to httpd (I think) so we should be ok, and y, the address 
is out in the open now.

Let me know if I should kick off another run.

Thank you, all!


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-10 Thread Allison, Timothy B.
Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz


RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
I won't have results immediately.  :)

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, May 9, 2017 4:13 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.:
> UGH.  I'm so wrong.  I accidentally had a 2.0.4.jar in my app/target...
>
> 
>
> Off we go?

Yes! However it's 10pm here, so I won't be able to react to the results 
immediately.

Tilman

>
>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, May 9, 2017 3:49 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> You caught me... I haven't checked these yet.
>
> But I did now, with
> MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
> 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
> IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
> but they don't throw an NPE anymore now.
>
> Oops... I see I have that check you mention in my code, it has been there for 
> months and I forgot to make an issue. But after removing it, it still works 
> with the three files... so the question is, can this parameter ever be null, 
> or not?
>
> Tilman
>
> Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
>> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 
>> new NPE exceptions)?  Has this been fixed, or would that cause unintended 
>> problems?
>>
>>   /**
>>* Returns true if the node is a page tree node (i.e. and 
>> intermediate).
>>*/
>>   private boolean isPageTreeNode(COSDictionary node )
>>   {
>>   // some files such as PDFBOX-2250-229205.pdf don't have Pages set 
>> as the Type, so we have
>>   // to check for the presence of Kids too
>>   return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
>>  node.containsKey(COSName.KIDS);
>>   }
>>
>> -Original Message-
>> From: Tilman Hausherr [mailto:thaush...@t-online.de]
>> Sent: Tuesday, May 9, 2017 3:20 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: 2.0.6 release ?
>>
>> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>>> I've fixed all remaining regression tickets (in the end it was 
>>>> exactly 1)
>>> Great!  Thank you!
>>>
>>> Let me know when I should kick off another eval.
>> Yes, please do.
>>
>> Thanks
>>
>> Tilman
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
>> additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
> B CB  
> [  X  ܚX KK[XZ[
>   ] ][  X  ܚX P
>   \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
>   ] Z[
>   \X K ܙ B B
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
UGH.  I'm so wrong.  I accidentally had a 2.0.4.jar in my app/target...



Off we go?


-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for 
months and I forgot to make an issue. But after removing it, it still works 
with the three files... so the question is, can this parameter ever be null, or 
not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 
> new NPE exceptions)?  Has this been fixed, or would that cause unintended 
> problems?
>
>  /**
>   * Returns true if the node is a page tree node (i.e. and intermediate).
>   */
>  private boolean isPageTreeNode(COSDictionary node )
>  {
>  // some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
> the Type, so we have
>  // to check for the presence of Kids too
>  return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
> node.containsKey(COSName.KIDS);
>  }
>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B CB  [  
X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B


RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
With lots of empty pages...

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Tuesday, May 9, 2017 3:57 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

Doh.  AR can't open it.  Sorry.  Chrome appears to be able to open it.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, May 9, 2017 3:56 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K

throws NPE and opens without complaint in AR.

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for 
months and I forgot to make an issue. But after removing it, it still works 
with the three files... so the question is, can this parameter ever be null, or 
not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 
> new NPE exceptions)?  Has this been fixed, or would that cause unintended 
> problems?
>
>  /**
>   * Returns true if the node is a page tree node (i.e. and intermediate).
>   */
>  private boolean isPageTreeNode(COSDictionary node )
>  {
>  // some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
> the Type, so we have
>  // to check for the presence of Kids too
>  return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
> node.containsKey(COSName.KIDS);
>  }
>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B CB  [  
X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B


RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Doh.  AR can't open it.  Sorry.  Chrome appears to be able to open it.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Tuesday, May 9, 2017 3:56 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K

throws NPE and opens without complaint in AR.

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for 
months and I forgot to make an issue. But after removing it, it still works 
with the three files... so the question is, can this parameter ever be null, or 
not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 
> new NPE exceptions)?  Has this been fixed, or would that cause unintended 
> problems?
>
>  /**
>   * Returns true if the node is a page tree node (i.e. and intermediate).
>   */
>  private boolean isPageTreeNode(COSDictionary node )
>  {
>  // some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
> the Type, so we have
>  // to check for the presence of Kids too
>  return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
> node.containsKey(COSName.KIDS);
>  }
>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B�CB��[��X��ܚX�KK[XZ[
�]�][��X��ܚX�P���
�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
�]�Z[���
�\X�K�ܙ�B�B


RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K

throws NPE and opens without complaint in AR.

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, May 9, 2017 3:49 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

You caught me... I haven't checked these yet.

But I did now, with
MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf
3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx
IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx
but they don't throw an NPE anymore now.

Oops... I see I have that check you mention in my code, it has been there for 
months and I forgot to make an issue. But after removing it, it still works 
with the three files... so the question is, can this parameter ever be null, or 
not?

Tilman

Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 
> new NPE exceptions)?  Has this been fixed, or would that cause unintended 
> problems?
>
>  /**
>   * Returns true if the node is a page tree node (i.e. and intermediate).
>   */
>  private boolean isPageTreeNode(COSDictionary node )
>  {
>  // some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
> the Type, so we have
>  // to check for the presence of Kids too
>  return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
> node.containsKey(COSName.KIDS);
>  }
>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, May 9, 2017 3:20 PM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>>> I've fixed all remaining regression tickets (in the end it was 
>>> exactly 1)
>> Great!  Thank you!
>>
>> Let me know when I should kick off another eval.
>
> Yes, please do.
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new 
NPE exceptions)?  Has this been fixed, or would that cause unintended problems?

/**
 * Returns true if the node is a page tree node (i.e. and intermediate).
 */
private boolean isPageTreeNode(COSDictionary node )
{
// some files such as PDFBOX-2250-229205.pdf don't have Pages set as 
the Type, so we have
// to check for the presence of Kids too
return node.getCOSName(COSName.TYPE) == COSName.PAGES ||
   node.containsKey(COSName.KIDS);
}

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, May 9, 2017 3:20 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>> I've fixed all remaining regression tickets (in the end it was 
>> exactly 1)
> Great!  Thank you!
>
> Let me know when I should kick off another eval.


Yes, please do.

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
>I've fixed all remaining regression tickets (in the end it was exactly 1)

Great!  Thank you!

Let me know when I should kick off another eval.

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Added a page count comparison report under "content/":

http://162.242.228.174/reports/reports_pdfbox_2_0_6c.tar.gz

-Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Tuesday, May 9, 2017 2:39 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz

Added CONTAINER_LENGTH to reports that have a file path.  This is the length in 
bytes of the container file (as opposed to the embedded file).

Thank you!

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table that is 
about individual files. I think it was there in the past, but I may be wrong.

Reason: I try to get small files to keep any "examples" for my regression tests.

Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org

B�CB��[��X��ܚX�KK[XZ[
�]�][��X��ܚX�P���
�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
�]�Z[���
�\X�K�ܙ�B�B

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz

Added CONTAINER_LENGTH to reports that have a file path.  This is the length in 
bytes of the container file (as opposed to the embedded file).

Thank you!

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table that is 
about individual files. I think it was there in the past, but I may be wrong.

Reason: I try to get small files to keep any "examples" for my regression tests.

Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Y.  Will do.  Meetings beckon, so it will take a few hours. :(

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation


Can you do me another favor? Have a column with the size in any table that is 
about individual files. I think it was there in the past, but I may be wrong.

Reason: I try to get small files to keep any "examples" for my regression tests.

Tilman


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-08 Thread Allison, Timothy B.
For the reports comparing 2.0.3 with 2.0.5, see 
https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14V1_15.zip
 

That was a full run against all file types of Tika 1.14 vs 1.15-SNAPSHOT from 
April 25.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, May 8, 2017 8:43 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?

Content

1)  To get a _general_ sense of overall content extract, see "content/ 
common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k 
"common words"[1], which out of 2.6 billion isn't much.  However, we also lost 
18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 
1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an 
improvement.

2)  If you want to compare content whether or not one there was a parse 
exception, see "content/content_diffs_with_exceptions.xlsx"

3) If you only want to see content diffs where both extracts did not have an 
exception, see "content/content_diffs_ignore_exceptions.xlsx".

To make quick sense of the content_diffs_files, sort 
"NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files 
lost the most common tokens.

To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which 
compare the number of unique tokens/tokens in common...a low number means 
little similarity, while a number close to 1.0 means that the unigrams are 
nearly identical.


From a quick look, many of the files with fewer common words are in the 
"likely_broken" and or "truncated" subdirectories...  Some exceptions to this 
rule include the following, but there are more...and overall, there is a fair 
amount of loss from 2.0.3.

govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56

[1] For this version of tika-eval, I expanded Tilman's initial recommendation 
of common words for English a bit.  I took the top 20k most common words (4 
characters or more, except for CJK) for a large number of Wikipedia dumps.  I 
removed common html markup words (body, form, table) so that failure to strip 
html doesn't incorrectly boost scores.

 We apply language id and then use the common words for that language.  For 
example, for 
truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW

* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 
tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there 
were 320 common words from the English list of common words.
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to.  Will kick off now?

Yes

Tilman

>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Saturday, May 6, 2017 10:02 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>>> any objections?
>> I'm targeting the 15th or 16th
> Tim, could you please run your tests when time allows?
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B CB  [  
X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-08 Thread Allison, Timothy B.
Content

1)  To get a _general_ sense of overall content extract, see "content/ 
common_token_comparisons_by_mime.xlsx"  This suggests that we've lost 248k 
"common words"[1], which out of 2.6 billion isn't much.  However, we also lost 
18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 
1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an 
improvement.

2)  If you want to compare content whether or not one there was a parse 
exception, see "content/content_diffs_with_exceptions.xlsx"

3) If you only want to see content diffs where both extracts did not have an 
exception, see "content/content_diffs_ignore_exceptions.xlsx".

To make quick sense of the content_diffs_files, sort 
"NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files 
lost the most common tokens.

To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which 
compare the number of unique tokens/tokens in common...a low number means 
little similarity, while a number close to 1.0 means that the unigrams are 
nearly identical.


From a quick look, many of the files with fewer common words are in the 
"likely_broken" and or "truncated" subdirectories...  Some exceptions to this 
rule include the following, but there are more...and overall, there is a fair 
amount of loss from 2.0.3.

govdocs1/202/202097.pdf
govdocs1/358/358043.pdf
commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6
commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56

[1] For this version of tika-eval, I expanded Tilman's initial recommendation 
of common words for English a bit.  I took the top 20k most common words (4 
characters or more, except for CJK) for a large number of Wikipedia dumps.  I 
removed common html markup words (body, form, table) so that failure to strip 
html doesn't incorrectly boost scores.

 We apply language id and then use the common words for that language.  For 
example, for 
truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW

* PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 
tokens from the French list of common words.
* PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there 
were 320 common words from the English list of common words.
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to.  Will kick off now?

Yes

Tilman

>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Saturday, May 6, 2017 10:02 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>>> any objections?
>> I'm targeting the 15th or 16th
> Tim, could you please run your tests when time allows?
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


B CB  [  
X  ܚX KK[XZ[
 ] ][  X  ܚX P   
 \X K ܙ B  ܈Y][ۘ[  [X[  K[XZ[
 ] Z[   
 \X K ܙ B B

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: 2.0.6 release ?

2017-05-08 Thread Allison, Timothy B.
Results here: http://162.242.228.174/reports/reports_pdfbox_2_0_6.tar.gz

A = 2.0.5
B = 2.0.6-SNAPSHOT from 12 hours ago.

I've only had a chance to look at the exceptions, attachments and metadata so 
far. 

For the new exceptions (roughly grouped by stacktrace), see 
"exceptions/new_exceptions_in_B_by_mime_by_stack_trace.xlsx"

For the full stack traces and triggering file paths (prepend 
http://162.242.228.174/docs to retrieve the source files), see 
"exceptions/new_excetions_in_B_details.xlsx".

For the fixed exceptions, see "exceptions/fixed_exceptions_in_B_by_mime.xlsx" 
and *_details.xlsx.

To confirm that the content of from the "fixed exceptions" looks language-y, 
scan through "exceptions/contents_of_fixed_exceptions_in_B.xlsx".

There are few handfuls of diffs in attachments and metadata, and I'll look into 
these.

Off to look at the contents...


-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to.  Will kick off now?

Yes

Tilman

>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Saturday, May 6, 2017 10:02 AM
> To: dev@pdfbox.apache.org
> Subject: Re: 2.0.6 release ?
>
> Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
>> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>>> Hi,
>>>
>>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>>> any objections?
>> I'm targeting the 15th or 16th
> Tim, could you please run your tests when time allows?
>
> Thanks
>
> Tilman
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




RE: low priority: proxy settings and unit tests?

2017-05-08 Thread Allison, Timothy B.
If there aren't objections, I'll open a ticket and make that change after the 
2.0.6 release.  Thank you!

-Original Message-
From: Maruan Sahyoun [mailto:sahy...@fileaffairs.de] 
Sent: Monday, May 8, 2017 9:39 AM
To: dev@pdfbox.apache.org
Subject: Re: low priority: proxy settings and unit tests?

How about skipping the test on ".. Connection refused .."?

BR
Maruan


> Am 08.05.2017 um 15:36 schrieb Allison, Timothy B. :
> 
> All,
>  Apologies for this one...  Is there an easy way to set proxy information for 
> the unit tests that get an InputStream via URL without changing any source 
> code or project poms?  In Intellij, I can modify the program arguments for 
> each one, but then, of course, maven doesn't pick up that information when I 
> do a build.
> 
>  I've been adding @Ignore to the unit tests in my local copy, but there has 
> to be a better way.
> 
> Failed tests:
>  PDButtonTest.testRadioButtonWithOptions:131 Unexpected IOException 
> Connection refused: connect
>  PDButtonTest.testOptionsAndNamesNotNumbers:187 Unexpected IOException 
> Connection refused: connect
> 
> Tests in error:
>  MergeAcroFormsTest.testAPEntry:92 > Connect Connection refused: 
> connect
>  MergeAcroFormsTest.testAnnotsEntry:59 > Connect Connection refused: 
> connect
>  MergeAnnotationsTest.testLinkAnnotations:61 > Connect Connection refused: 
> conn...


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



low priority: proxy settings and unit tests?

2017-05-08 Thread Allison, Timothy B.
All,
  Apologies for this one...  Is there an easy way to set proxy information for 
the unit tests that get an InputStream via URL without changing any source code 
or project poms?  In Intellij, I can modify the program arguments for each one, 
but then, of course, maven doesn't pick up that information when I do a build.

  I've been adding @Ignore to the unit tests in my local copy, but there has to 
be a better way.

Failed tests:
  PDButtonTest.testRadioButtonWithOptions:131 Unexpected IOException Connection 
refused: connect
  PDButtonTest.testOptionsAndNamesNotNumbers:187 Unexpected IOException 
Connection refused: connect

Tests in error:
  MergeAcroFormsTest.testAPEntry:92 > Connect Connection refused: connect
  MergeAcroFormsTest.testAnnotsEntry:59 > Connect Connection refused: connect
  MergeAnnotationsTest.testLinkAnnotations:61 > Connect Connection refused: 
conn...


RE: 2.0.6 release ?

2017-05-08 Thread Allison, Timothy B.
Happy to.  Will kick off now?

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Saturday, May 6, 2017 10:02 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?

Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler:
>> Hi,
>>
>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, 
>> any objections?
> I'm targeting the 15th or 16th

Tim, could you please run your tests when time allows?

Thanks

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org




jai-imageio-core -- BSD-3 with nuclear clause

2017-04-27 Thread Allison, Timothy B.
PDFBox colleagues,

  On TIKA-2338, we're considering incorporating jai-imageio-core into Tika 
(removing the "provided" scope) because the authors on github claim that 
they've removed the non-ASL 2.0 parts out of jai-imageio-core.

  We noticed, though, that this is BSD-3 with the nuclear clause.  We can't 
find anything about nukes in the usual place[1].  We've opened LEGAL-304.

  Have you considered this at all?  Would you have any insight into whether the 
nuclear clause is "field of use" (which would mean we could not do this) or 
"acceptance of no liability" (which would mean we could do this).

  Thank you.

 Cheers,

   Tim

[1] https://www.apache.org/legal/resolved.html


RE: [VOTE] Release Apache PDFBox 2.0.5

2017-03-16 Thread Allison, Timothy B.
Tilman and Andreas, thank you for taking a look!

I agree no need to stop the release.  The improvements far outweigh the small 
regression.

> I had a look at content_diffs_with_exceptions.xlsx, then looking only 
> at govdocs there, all are similar or better.

Y, agreed.  Do we care about these likely broken PDFs from which 2.0.4 appears 
to be able to extract more "common words" than 2.0.5?  

commoncrawl2_likely_broken/OV/OVWMJPQGCK2AQZYVWJWYUPTERPXOGIAD
commoncrawl2_likely_broken/R4/R4P75EJNMNXZC2DQYUFB6BSXQ2CWGVG7.pdf
commoncrawl2_likely_broken/BI/BIVJLJ4QULQQ4VHKKNMBUTKWXAMMN53N.pdf
commoncrawl2_likely_broken/LB/LB6LEZ75Y6OL7SGW7SV6JNO4G6FS7HAS
commoncrawl2_likely_broken/LQ/LQQFDYEI7XTOBMFPSL3IDVKRMUB6YIGU
commoncrawl2_likely_broken/OB/OBQTIKQW3MIEYJPGE4NR5WGPDUZC3ULY
commoncrawl2_likely_broken/BC/BCZSFNQAB62TUBURWG6B3ZOZCG5IH46P
commoncrawl2_likely_broken/TV/TVMANAJVH2VQVABYX6LCVO5KTERLFS2I.pdf

Out of 543,805 PDFs in our test set, and given that they're broken, I'm not 
overly concerned.

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Wednesday, March 15, 2017 5:30 PM
To: dev@pdfbox.apache.org
Subject: Re: [VOTE] Release Apache PDFBox 2.0.5

Am 15.03.2017 um 19:07 schrieb Tilman Hausherr:
> Thanks Tim!
>
> I looked at newExceptionsInBDetails.xlsx (247 entries). IMHO no need 
> to stop the release, the number of entries in 
> fixedExceptionsInBDetails.xlsx (506) is larger, and the files with exceptions 
> are cut off.
I agree. However, I've checked one of the files 015664.pdf and it looks like an 
regression. I can open it using 2.0.4 but get the described exception with 
2.0.5 :-(

BR
Andreas

> I'll create an issue about these.
>
> I had a look at content_diffs_with_exceptions.xlsx, then looking only 
> at govdocs there, all are similar or better.
>
> Tilman
>
> Am 15.03.2017 um 00:03 schrieb Allison, Timothy B.:
>> +1
>>
>> I ran a comparison with 2.0.5-rc1 and (I think) 2.0.4 against ~500k 
>> files from our regression corpus.
>>
>> I haven't had a chance to do much digging, but I wanted to share what 
>> I had as soon as I had it.
>>
>> Reports are here:
>> https://github.com/tballison/share/blob/master/pdfbox_comparisons/rep
>> orts_pdfbox_2.0.5-rc1.zip
>>
>>
>> Lots more "common words".  Many fewer exceptions.  There may be a 
>> regression that is causing 244 new exceptions, but on balance, the 
>> improvements are impressive.
>>
>>
>> java.io.IOException: Missing root object specification in trailer.
>> at
>> org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(C
>> OSParser.java:2169)
>>
>> at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:222)
>> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:271)
>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
>> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:922)
>> at
>> ...
>>
>> -Original Message-
>> From: Timo Boehme [mailto:timo.boe...@ontochem.com]
>> Sent: Tuesday, March 14, 2017 9:11 AM
>> To: dev@pdfbox.apache.org
>> Subject: Re: [VOTE] Release Apache PDFBox 2.0.5
>>
>> Hi,
>>
>> +1
>>
>> Maybe we should add the
>> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true
>> setting (introduced with 2.0.4) to the Migration/Getting Started 
>> Web-Pages. I had to look through my emails in order to find it and it 
>> really makes a difference (at least on some systems) if there are a 
>> lot of images on a page - so far we only have the
>> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
>> setting documented (which did not help in my case). At least the user 
>> may try it out if rendering gets slow on some pages; it may not be a 
>> good general setting as it also may slow rendering down a bit on pages with 
>> few large images.
>>
>>
>> Best,
>> Timo
>>
>>
>> Am 13.03.2017 um 19:18 schrieb Andreas Lehmkuehler:
>>> Hi,
>>>
>>> a candidate for the PDFBox 2.0.5 release is available at:
>>>
>>>  https://dist.apache.org/repos/dist/dev/pdfbox/2.0.5/
>>>
>>> The release candidate is a zip archive of the sources in:
>>>
>>>  http://svn.apache.org/repos/asf/pdfbox/tags/2.0.5/
>>>
>>> The SHA1 checksum of the archive is
>>> 9521349be859498dfdd0e0f2a5d02b082f097ab1.
>>>
>>> Please vote on releasing this package as Apache PDFBox 2.0.5.
>>> The vote is open for the next 72 hours and passes if a majority of 

RE: [VOTE] Release Apache PDFBox 2.0.5

2017-03-14 Thread Allison, Timothy B.
+1

I ran a comparison with 2.0.5-rc1 and (I think) 2.0.4 against ~500k files from 
our regression corpus.

I haven't had a chance to do much digging, but I wanted to share what I had as 
soon as I had it.

Reports are here: 
https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2.0.5-rc1.zip
 

Lots more "common words".  Many fewer exceptions.  There may be a regression 
that is causing 244 new exceptions, but on balance, the improvements are 
impressive.


java.io.IOException: Missing root object specification in trailer.
at 
org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2169)
at 
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:222)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:271)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:922)
at 
...

-Original Message-
From: Timo Boehme [mailto:timo.boe...@ontochem.com] 
Sent: Tuesday, March 14, 2017 9:11 AM
To: dev@pdfbox.apache.org
Subject: Re: [VOTE] Release Apache PDFBox 2.0.5

Hi,

+1

Maybe we should add the
   -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true
setting (introduced with 2.0.4) to the Migration/Getting Started Web-Pages. I 
had to look through my emails in order to find it and it really makes a 
difference (at least on some systems) if there are a lot of images on a page - 
so far we only have the
   -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
setting documented (which did not help in my case). At least the user may try 
it out if rendering gets slow on some pages; it may not be a good general 
setting as it also may slow rendering down a bit on pages with few large images.


Best,
Timo


Am 13.03.2017 um 19:18 schrieb Andreas Lehmkuehler:
> Hi,
>
> a candidate for the PDFBox 2.0.5 release is available at:
>
> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.5/
>
> The release candidate is a zip archive of the sources in:
>
> http://svn.apache.org/repos/asf/pdfbox/tags/2.0.5/
>
> The SHA1 checksum of the archive is
> 9521349be859498dfdd0e0f2a5d02b082f097ab1.
>
> Please vote on releasing this package as Apache PDFBox 2.0.5.
> The vote is open for the next 72 hours and passes if a majority of at 
> least three +1 PDFBox PMC votes are cast.
>
> [ ] +1 Release this package as Apache PDFBox 2.0.5
> [ ] -1 Do not release this package because...
>
>
> Here is my +1
>
> BR
> Andreas Lehmkühler
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


--
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4| fax: +49 345 478 047 1
email: timo.boe...@ontochem.com | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal   | USt-IdNr.: DE815563824
managing director : Lutz Weber


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



tika-eval

2017-02-17 Thread Allison, Timothy B.
All,

  I finally got around to adding tika-eval[1] to Apache Tika.  If you have any 
interest in comparing the output of different tools/versions/parameters on text 
extraction, give it a try.  You don't need to use Tika or format the output in 
a specific format; plain UTF-8 text will work.

  Tilman, I generalized your common word count methodology.  The code now runs 
language id on the text and then counts the common words for that language.

  Lots more work remains.  Thank you, all, for contributing to the 
methodologies!

 Cheers,

  Tim


[1] https://wiki.apache.org/tika/TikaEval


RE: [VOTE] Release Apache PDFBox 2.0.4

2016-12-12 Thread Allison, Timothy B.
+1

Comparisons available here:

http://162.242.228.174/reports/reports_pdfbox_2_0_3_vs_2_0_4-rc1.tar.bz2

No new exceptions, a few fixed exceptions, better content extraction.  Thank 
you, all!

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Monday, December 12, 2016 12:53 PM
To: dev@pdfbox.apache.org
Subject: [VOTE] Release Apache PDFBox 2.0.4

Hi,

a candidate for the PDFBox 2.0.4 release is available at:

 https://dist.apache.org/repos/dist/dev/pdfbox/2.0.4/

The release candidate is a zip archive of the sources in:

 http://svn.apache.org/repos/asf/pdfbox/tags/2.0.4/

The SHA1 checksum of the archive is 4b1844a268d65b05ac371a848c0b8c27f390052b.

Please vote on releasing this package as Apache PDFBox 2.0.4.
The vote is open for the next 72 hours and passes if a majority of at least 
three +1 PDFBox PMC votes are cast.

 [ ] +1 Release this package as Apache PDFBox 2.0.4
 [ ] -1 Do not release this package because...


Here is my +1

BR
Andreas Lehmkühler

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: New releases

2016-12-12 Thread Allison, Timothy B.
Or, turns out the 12th...ugh.  I just kicked off the regression tests.  Should 
have results within 8 hours.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Tuesday, November 29, 2016 3:36 PM
To: dev@pdfbox.apache.org
Subject: RE: New releases

+1
 
I should have time to run the regression tests against 2.0.x the week of the 
5th.

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, November 29, 2016 2:21 AM
To: dev@pdfbox.apache.org
Subject: Re: New releases

Am 28.11.2016 um 21:38 schrieb Andreas Lehmkuehler:
> Am 24.11.2016 um 14:43 schrieb Andreas Lehmkuehler:
>> Hi,
>>
>> I'm planing to cut new releases for 1.8.x and 2.0.x in about 2-3 
>> weeks from now.
>
> I'm going to cut the releases as follows:
>
> - 1.8.13 on Monday 5th of December
> - 2.0.4 on Monday 12th of December

+1



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



FW: ApacheCon Miami is coming in May.

2016-11-30 Thread Allison, Timothy B.
> ApacheCon and Apache Big Data will be held at the Intercontinental in Miami, 
> Florida, May 16-18, 2017

I plan to attend.  

Who's in?  Any interest in collaborating on a talk or submitting your own?

Cheers,

 Tim

-Original Message-
From: Rich Bowen [mailto:rbo...@apache.org] 
Sent: Wednesday, November 30, 2016 1:34 PM
Subject: ApacheCon Miami is coming in May.

Dear Apache Committer,


ApacheCon and Apache Big Data will be held at the Intercontinental in Miami, 
Florida, May 16-18, 2017.

...


RE: New releases

2016-11-29 Thread Allison, Timothy B.
+1
 
I should have time to run the regression tests against 2.0.x the week of the 
5th.

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, November 29, 2016 2:21 AM
To: dev@pdfbox.apache.org
Subject: Re: New releases

Am 28.11.2016 um 21:38 schrieb Andreas Lehmkuehler:
> Am 24.11.2016 um 14:43 schrieb Andreas Lehmkuehler:
>> Hi,
>>
>> I'm planing to cut new releases for 1.8.x and 2.0.x in about 2-3 
>> weeks from now.
>
> I'm going to cut the releases as follows:
>
> - 1.8.13 on Monday 5th of December
> - 2.0.4 on Monday 12th of December

+1



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Apache Tika's public regression corpus

2016-10-05 Thread Allison, Timothy B.
All,

I recently blogged about some of the work we're doing with a large scale 
regression corpus to make Tika, POI and PDFBox more robust and to identify 
regressions before release.  If you'd like to chip in with recommendations, 
requests or Hadoop/Spark clusters (why not shoot for the stars), please do!

  
http://openpreservation.org/blog/2016/10/04/apache-tikas-regression-corpus-tika-1302/

Many thanks, again, to Rackspace for our vm and to Common Crawl and govdocs1 
for most of our files!

Cheers,

 Tim

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: New PDFBox Committer

2016-09-19 Thread Allison, Timothy B.
Thank you, all! I am honored to join your ranks!

Cheers,

   Tim
 
-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Monday, September 19, 2016 7:55 AM
To: dev@pdfbox.apache.org
Subject: New PDFBox Committer

Hi,

I'm happy to announce that the PDFBox PMC has decided to offer committership in 
Apache PDFBox to Tim Allison. He has accepted the offer and should have his 
committer-bits ready by now.

As all other committers Tim has joined the PMC as well.

BR
Andreas Lehmkühler

P.S.: Some of you might already know Tim as committer and PMC Member of Apache 
Tika and Apache POI.

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: PDFBox 2.0.3 TIKA comparison

2016-09-15 Thread Allison, Timothy B.
Great.  Thank you!

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Thursday, September 15, 2016 12:03 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison

Am 14.09.2016 um 20:50 schrieb Tilman Hausherr:
>
>> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>>
>>>
>>> There are some regressions in content extraction, but overall, 
>>> content extraction looks to have improved quite a bit.  Looks like
>>> ~2 million more "common English words" via Tilman's methodology. 
>
> After some wandering around I finally looked at content extraction 
> only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
> It turned out that all files were from Delaware courts, so I've 
> decided to look only at one single file, 
> Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
> The extracted text with 2.0.2 and 2.0.3 is
>
> IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE
>
> in 2.0.1 and 1.8 it is
>
> IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE
>
> For 1.8 the explanation is that text extraction takes words, while in
> 2.* each character is taken alone.
>
> The bad result in 2.0.3 is because of an incorrect /W array. The space 
> has a width of 3, while other characters have widths between 200 and 
> 722. So PDFBox believes that there are spaces where there are none.

The story is different, the space width (which is 250, not 3 - the table is a 
ranges array) is NOT taken from the space glyph, but from an average of all 
glyphs. It's a good thing I looked past in history. The breaking change was in 
rev 1744613 (PDFBOX-3354) and is related to the calculation of the average 
glyph width. Before rev 1744613 the averageWidth was always 0 (due to a bug 
likely accidentally introduced in some refactoring), which was corrected to a 
default value (1000) in text extraction.

Starting with rev 1744613 an average width was calculated, but due to many 0 
values (over 65534) in the /W ranges array, the result was
unreliable:

/W [1 1 0 2 3 250 4 10 0 11
12 333 13 14 0 15 15 250 16 16
333 17 17 250 18 18 277 19 19 0
20 23 500 24 35 0 36 36 722 37
37 666 38 39 722 40 40 666 41 41
610 42 43 777 44 44 389 45 45 0
46 46 777 47 47 666 48 48 943 49
49 722 50 50 777 51 51 610 52 52
0 53 53 722 54 54 556 55 55 666
56 57 722 59 59 0 60 60 722 61
67 0 68 68 500 69 69 556 70 70
443 71 71 556 72 72 443 73 73 333
74 74 500 75 75 556 76 76 277 77
77 0 78 78 556 79 79 277 80 80
833 81 81 556 82 82 500 83 84 556
85 85 443 86 86 389 87 87 333 88
88 556 89 89 0 90 90 722 91 92
500 93 178 0 179 180 500 181 181 0
182 182 333 183 751 0 752 752 198 753
794 0 795 795 612 796 1126 0 1127 1127
125 1129 1129 2000 1130 65534 0]

Solution: ignore widths that are <=0. 0 values in PDFont are already ignored in 
PDFont, but not in PDCIDFont.

Before the solution: 0.52861196. After the fix: 549.8571.

I'll open an issue and commit a fix after sending this. It won't be in 2.0.3, 
but in 2.0.4.

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: PDFBox 2.0.3 TIKA comparison

2016-09-15 Thread Allison, Timothy B.
Perfect.  Thank you!

-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Thursday, September 15, 2016 8:31 AM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison

Am 15.09.2016 um 13:52 schrieb Allison, Timothy B.:
>> The one apparent major new exception for PDF files was apparently fixed 
>> before 2.0.3.  So, please ignore that one!
>
> Wait...if possible, please confirm that you did fix this recently (within the 
> last week or two).  I ran pdfbox app's (2.0.3) on a handful of triggering 
> files and didn't get the exception...however, it is possible that 
> multithreading might trigger this exception.


I've fixed that 2 days ago, it's part of the RC.

BR
Andreas
>
> java.lang.NullPointerException
>   at 
> org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118)
>   at 
> org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151)
>   at 
> org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.(DictionaryEncoding.java:128)
>   at 
> org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129)
>   at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:209)
>   at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
>   at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
>   at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
>   at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
>   at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>   at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>   at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
>   at 
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
>   at 
> org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407)
>   at 
> org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
>   at 
> org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182)
>   at 
> org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)
>   at 
> org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: PDFBox 2.0.3 TIKA comparison

2016-09-15 Thread Allison, Timothy B.
If this doesn't look like something you've recently fixed, I can rerun with the 
actual 2.0.3-rc1 (only on pdfs!) and see if I'm still getting this exception.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, September 15, 2016 7:53 AM
To: dev@pdfbox.apache.org
Subject: RE: PDFBox 2.0.3 TIKA comparison
Importance: High

> The one apparent major new exception for PDF files was apparently fixed 
> before 2.0.3.  So, please ignore that one!

Wait...if possible, please confirm that you did fix this recently (within the 
last week or two).  I ran pdfbox app's (2.0.3) on a handful of triggering files 
and didn't get the exception...however, it is possible that multithreading 
might trigger this exception.

java.lang.NullPointerException
at 
org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118)
at 
org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151)
at 
org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.(DictionaryEncoding.java:128)
at 
org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:209)
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
at 
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
at 
org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407)
at 
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
at 
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182)
at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)
at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: PDFBox 2.0.3 TIKA comparison

2016-09-15 Thread Allison, Timothy B.
> The one apparent major new exception for PDF files was apparently fixed 
> before 2.0.3.  So, please ignore that one!

Wait...if possible, please confirm that you did fix this recently (within the 
last week or two).  I ran pdfbox app's (2.0.3) on a handful of triggering files 
and didn't get the exception...however, it is possible that multithreading 
might trigger this exception.

java.lang.NullPointerException
at 
org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118)
at 
org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151)
at 
org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.(DictionaryEncoding.java:128)
at 
org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129)
at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:209)
at 
org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
at 
org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143)
at 
org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74)
at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158)
at 
org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407)
at 
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)
at 
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182)
at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)
at 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: PDFBox 2.0.3 TIKA comparison

2016-09-14 Thread Allison, Timothy B.
> Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content 
> extraction looks to have improved quite a bit" :-)

Y, absolutely.  Thank _you_ for reviewing the output and all of your other 
work, of course!

Cheers,

  Tim

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Wednesday, September 14, 2016 2:50 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison


> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>
>>
>> There are some regressions in content extraction, but overall, 
>> content extraction looks to have improved quite a bit.  Looks like ~2 
>> million more "common English words" via Tilman's methodology.

After some wandering around I finally looked at content extraction only, at 
column P ("TOP_10_MORE_IN_A") for cells with meaningful words.
It turned out that all files were from Delaware courts, so I've decided to look 
only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW.
The extracted text with 2.0.2 and 2.0.3 is

IN THE  COUR T OF  CHAN CER Y O F TH E STA TE OF  D ELA WARE

in 2.0.1 and 1.8 it is

IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE

For 1.8 the explanation is that text extraction takes words, while in
2.* each character is taken alone.

The bad result in 2.0.3 is because of an incorrect /W array. The space has a 
width of 3, while other characters have widths between 200 and 722. So PDFBox 
believes that there are spaces where there are none.

The only mystery that remains is why it worked in 2.0.1. Maybe that one took an 
average glyph width for spaces, or the width value from the font itself. I'll 
find this out later, but it isn't a high priority. A look at column Q 
("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content extraction 
looks to have improved quite a bit" :-)

Thanks for testing!

Tilman



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: PDFBox 2.0.3 TIKA comparison

2016-09-14 Thread Allison, Timothy B.

That was caused by a cap we placed in Tika in extracting XMP history: TIKA-1999 
[1]

We haven't switched to XMPBox...still on JempBox from 1.8.x.

https://issues.apache.org/jira/browse/TIKA-1999

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Wednesday, September 14, 2016 12:52 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison

Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
> https://github.com/tballison/share/blob/master/tika_comparisons/report
> s_tika_20160904_dev.zip
>
> This run was against the full corpus, not just PDFs.  I used a fairly recent 
> nightly build of PDFBox and POI's 3.15-rc1.
>
> The one apparent major new exception for PDF files was apparently fixed 
> before 2.0.3.  So, please ignore that one!
>
> There are some regressions in content extraction, but overall, content 
> extraction looks to have improved quite a bit.  Looks like ~2 million more 
> "common English words" via Tilman's methodology.
>
> Let me know if you have any questions.

I wonder what happened here:
commoncrawl2/SH/SHMSOEBK4QOJO5CY7BIWWDH6GHSTOXYM

metadata went from 6766 to 4134.

Is this a TIKA thing, or is this because of a change from xmpbox to jempbox?

Tilman



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: PDFBox 2.0.3?

2016-09-14 Thread Allison, Timothy B.
https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip

This run was against the full corpus, not just PDFs.  I used a fairly recent 
nightly build of PDFBox and POI's 3.15-rc1.

The one apparent major new exception for PDF files was apparently fixed before 
2.0.3.  So, please ignore that one!

There are some regressions in content extraction, but overall, content 
extraction looks to have improved quite a bit.  Looks like ~2 million more 
"common English words" via Tilman's methodology.

Let me know if you have any questions.

Cheers,

 Tim

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, September 12, 2016 12:58 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3?

Am 12.09.2016 um 18:47 schrieb Allison, Timothy B.:
> Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/ 
> Tika 1.13).

Yes please, when you have the time, I expect no more changes.

Tilman



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: PDFBox 2.0.3?

2016-09-12 Thread Allison, Timothy B.
Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/ 
Tika 1.13).

Cheers,

   Tim


PDFBox 2.0.3?

2016-08-11 Thread Allison, Timothy B.
PDFBox Colleagues,
  We may be heading towards a release of Tika 1.14 over the next month, maybe 
early September.  Any plans for a PDFBox 2.0.3 release before then?  I'm happy 
to recommend to my Tika-colleagues a delay if you would naturally be releasing 
somewhere around then.

 Best,

Tim


FW: Apache Tika used to parse the Panama papers!

2016-04-06 Thread Allison, Timothy B.
Looks like quite a few PDFs [0]...

Couldn't have done it without you! 

Cheers,

   Tim

P.S. Tip of the hat to Andreas for rt the link!

[0] https://twitter.com/bigdata/status/717346207312392192 

-Original Message-
From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] 
Sent: Tuesday, April 05, 2016 6:47 PM
To: d...@tika.apache.org
Cc: pr...@apache.org
Subject: Apache Tika used to parse the Panama papers!

FYI:
http://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak/?utm_campaign=ForbesTech&utm_source=TWITTER&utm_medium=social&utm_channel=Technology&linkId=23087770#709893771df5


BTW I know Thomas and am in touch..he wrote an article about MEMEX last year.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate 
Professor, Computer Science Department University of Southern California, Los 
Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++







RE: shading/relocating 1.8.x?

2016-03-29 Thread Allison, Timothy B.
Got it.  That's what I had assumed.

I'll hold off on opening truncated file issue(s) on PDFBox's JIRA...  I opened 
TIKA-1912 to track this on our side.

Thank you, again!

Best,

  Tim

-Original Message-
From: Andreas Lehmkühler [mailto:andr...@lehmi.de] 
Sent: Tuesday, March 29, 2016 7:12 AM
To: dev@pdfbox.apache.org
Subject: RE: shading/relocating 1.8.x?

> "Allison, Timothy B."  hat am 28. März 2016 um 
> 21:02
> geschrieben:
> 
> 
> Oh, wow, so it really might be possible without too much work?  I'm 
> more than happy to supply examples. :)
Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox 
most likel runs into a NPE. IMHO we have to implement some sort of an on demand 
parser which is able to handle null-values for specific parts of a pdf without 
throwing any exception.

> Should I open an issue?
Thanks, but I'm going to do that soon, as some other things should be done as 
well.

BR
Andreas



RE: shading/relocating 1.8.x?

2016-03-28 Thread Allison, Timothy B.
Oh, wow, so it really might be possible without too much work?  I'm more than 
happy to supply examples. :) 

Should I open an issue?


-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Monday, March 28, 2016 10:58 AM
To: dev@pdfbox.apache.org
Subject: Re: shading/relocating 1.8.x?

Am 25.03.2016 um 17:39 schrieb John Hewson:
>
>> On 23 Mar 2016, at 06:20, Allison, Timothy B.  wrote:
>>
>> All,
>>   We've upgraded to 2.0.0 on Tika.  Many thanks again!
>>   One of our users is interested in continuing to use the 
>> classic/SequentialParser, or at least having it available as a back-off 
>> parser for corrupt pdfs [0].
>
> Using the old parser really isn’t a good idea, it’s known to be pretty 
> broken. I think that we would be much better off making sure the new parser 
> can handle truncated files. We already do a lot of repair in the new parser, 
> so this doesn’t seem like to much work? Maybe Andreas can comment further?
The biggest issue here is the truncated stream or dictionary. The current 
version simply throws an exception when running into such constellations. We 
have to implement some algorithm to ignore such incomplete parts of a pdf if 
possible.

BR
Andreas

>
> Do we have some JIRA issues which identify some of these cases?
>
> — John
>
>>   Would you be willing to distribute a shaded/relocated 1.8.x app so that we 
>> could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or, is 
>> there a better solution?
>
> I wouldn’t recommend doing that, because you’re going to be stuck with using 
> 1.8 for everything, not just parsing, at least as far as corrupt/truncated 
> files are concerned.
>
> — John
>
>>   Thank you!
>>
>>   Cheers,
>>
>>  Tim
>>
>> [0] 
>> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: shading/relocating 1.8.x?

2016-03-28 Thread Allison, Timothy B.
See:

https://issues.apache.org/jira/browse/TIKA-1285?focusedCommentId=15214111&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15214111
 

-Original Message-
From: John Hewson [mailto:j...@jahewson.com] 
Sent: Friday, March 25, 2016 1:03 PM
To: dev@pdfbox.apache.org
Subject: Re: shading/relocating 1.8.x?


> On 25 Mar 2016, at 09:44, Tilman Hausherr  wrote:
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
> 
> https://issues.apache.org/jira/browse/PDFBOX-3265
> 

Great! Does anyone else have some others?

— John

> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: shading/relocating 1.8.x?

2016-03-25 Thread Allison, Timothy B.
Hi John,

  Normally, I'd agree.  And, y, I've been extremely grateful for the effort put 
into dealing with noisy PDFs in 2.0.0.  However, I think that the Tika user 
requesting this is interested in getting what he can from truncated and truly 
broken files -- e.g. Common Crawl data which (I think) truncates files at 1MB 
or may have had an interrupt during download.  My basic rule for opening an 
issue is if AR or another pdf parser can't parse it, I'm not going to ask for 
help.
 
   I wouldn't want to direct your all's efforts to dealing with the edge cases 
of truncated files.  If the old PDFParser is able to get something out because 
it parsed sequentially, then it would be neat to be able to have that available 
with very little effort.  In Tika, we envision allowing users to configure 
combinations of parsers for a given file, this would be the perfect case for 
the back-off-on-exception strategy -- if there's an exception with 2.0.0, try 
again with 1.8.x.

  I'll try shading/relocating next week, and see whether that works as expected.

  Thank you, all, again!

  Cheers,

Tim


-Original Message-
From: John Hewson [mailto:j...@jahewson.com] 
Sent: Friday, March 25, 2016 1:03 PM
To: dev@pdfbox.apache.org
Subject: Re: shading/relocating 1.8.x?


> On 25 Mar 2016, at 09:44, Tilman Hausherr  wrote:
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
> 
> https://issues.apache.org/jira/browse/PDFBOX-3265
> 

Great! Does anyone else have some others?

— John

> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



shading/relocating 1.8.x?

2016-03-23 Thread Allison, Timothy B.
All,
  We've upgraded to 2.0.0 on Tika.  Many thanks again!
  One of our users is interested in continuing to use the 
classic/SequentialParser, or at least having it available as a back-off parser 
for corrupt pdfs [0].
  Would you be willing to distribute a shaded/relocated 1.8.x app so that we 
could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or, is 
there a better solution?

  Thank you!

  Cheers,

 Tim

[0] 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: The Apache® Software Foundation announces Apache PDFBox™ v2.0

2016-03-21 Thread Allison, Timothy B.
Congratulations! And, thank you!

Cheers,

  Tim

-Original Message-
From: Andreas Lehmkühler [mailto:andr...@lehmi.de] 
Sent: Monday, March 21, 2016 10:11 AM
To: us...@pdfbox.apache.org
Subject: Fwd: The Apache® Software Foundation announces Apache PDFBox™ v2.0




 Ursprüngliche Nachricht 
Von: Sally Khudairi 
Gesendet: 21. März 2016 12:44:18 MEZ
An: Apache Announce List 
Betreff: The Apache® Software Foundation announces Apache PDFBox™ v2.0

>> this announcement is available online at https://s.apache.org/Ly9B

Milestone release of Open Source Java tool for working with PDF documents 
features dozens of improvements and enhancements

Forest Hill, MD —21 March 2016— The Apache Software Foundation (ASF), the 
all-volunteer developers, stewards, and incubators of more than 350 Open Source 
projects and initiatives, announced today the availability of Apache® PDFBox™ 
v2.0, the Open Source Java tool for working with Portable Document Format (PDF) 
documents. 

PDF was first released by Adobe Systems in 1993, and became an ISO 
International Standard - ISO 32000-1 in 2008. Apache PDFBox allows for the 
creation of new PDF documents, manipulation, rendering, signing of existing 
documents and the ability to extract content from documents. In addition, 
PDFBox includes several command line utilities. In February 2015, the project 
became the first Open Source Partner Organization of the PDF Association. 

"PDF is a very popular and easy to use format for document exchange. It is used 
by millions of people every day, however the format itself is quite complicated 
and a real challenge to write a piece of software to work with it," said 
Andreas Lehmkühler, Vice President of Apache PDFBox. "This new major release of 
PDFBox includes a lot of improvements, fixes and new features which should make 
the life easier for our users." 

Under The Hood
The Apache PDFBox library enables users to create new PDF documents, manipulate 
existing documents, extract content, digitally sign, print, and validate files 
against the PDF/A-1b standard. Its command line utilities include encrypt, 
decrypt, overlay, debugger, merger, PDFToImage, and TextToPDF. 

PDFBox v2.0 reflects 1,167 solved issues, 418 of which were back-ported to 
v1.8, as well as dozens of improvements and enhancements. Highlights include: 

 - improved rendering and text extraction
 - Unicode support for PDF creation
 - overhauled interactive forms support
 - extended signing and encryption support
 - overhauled parser including a self-healing mechanism for malformed or 
corrupted PDFs
 - reduced memory/resources footprint including fine grained control of memory 
usage
 - enhanced preflight module for PDF/A-1b conformance checking
 - rearranged package structure to allow smaller runtime environments 

A guide to migrating to v2.0 is available at 
http://pdfbox.apache.org/2.0/migration.html , with community support at 
http://pdfbox.apache.org/mailinglists.html 

"We thank all the people from our small but fine community for their support," 
explained Lehmkühler. "Special thanks also goes to our fellow colleagues from 
the Apache Tika project for their cooperation in stress-testing with a corpus 
of 250,000 PDF files." 

"We are grateful for the Google Summer of Code program," said PDFBox committer 
Tilman Hausherr. "The project allowed us to hire students to improve 3D 
rendering and the PDFDebugger stand-alone application, which also sped up our 
own bug finding." 

"Apache PDFBox v2.0 is a significant milestone as it took us several years to 
complete," added Lehmkühler. "This long-awaited release is the collective 
achievement of more than 150 individuals who have contributed code to date. 
Without their frequent contributions it wouldn't be possible to drive a project 
like PDFBox." 

Availability and Oversight
Apache PDFBox software is released under the Apache License v2.0 and is 
overseen by a self-selected team of active contributors to the project. A 
Project Management Committee (PMC) guides the Project's day-to-day operations, 
including community development and product releases. For downloads, 
documentation, and ways to become involved with Apache PDFBox, visit 
http://pdfbox.apache.org/ 

About The Apache Software Foundation (ASF) Established in 1999, the 
all-volunteer Foundation oversees more than 350 leading Open Source projects, 
including Apache HTTP Server --the world's most popular Web server software. 
Through the ASF's meritocratic process known as "The Apache Way," more than 550 
individual Members and 5,300 Committers successfully collaborate to develop 
freely available enterprise-grade software, benefiting millions of users 
worldwide: thousands of software solutions are distributed under the Apache 
License; and the community actively participates in ASF mailing lists, 
mentoring initiatives, and ApacheCon, the Foundation's official user 
conference, trainings, and expo. The ASF is a US 501(c)(3) 

RE: roadmap for XMPBox?

2016-03-11 Thread Allison, Timothy B.
Thank you, Beat.  Y, as one of our devs pointed out, we're using that already 
in Tika in our XMP module for writing XMP...we haven't looked into using it for 
extraction.

-Original Message-
From: Beat Weisskopf [mailto:weissk...@glue.ch] 
Sent: Friday, March 11, 2016 3:40 AM
To: dev@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?

Hi all

As a third option: What about the BSD-licensed Adobe XMP Toolkit? At least 
verapdf seems to use a fork it: https://github.com/veraPDF/veraPDF-xmp

Cheers, beat


Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
>When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch 
> from our current reliance on jempbox to XMPBox.  I recently extracted ~70k 
> XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, 
> there were exceptions on roughly 40% of the XMPs.
>
>
>
>I’m including a table below of the counts of exception messages.  Are 
> there any plans to make XMPBox more lenient or is this what we can expect 
> going forward?
>
>
>
>As always, I’m more than happy to help with files and tests.  Let me know 
> what I can do.
>
>
>
>   Cheers,
>
>
>
>Tim
>
>
>
> No XmpParsingException on 42,022 files.
>
>
>
>
>
>
>
> Exceptions:
>
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/pdfx/1.3/
>
> 13403
>
> Type 'originalDocumentID' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 3710
>
> Missing pdfaSchema:property in type definition
>
> 3113
>
> Expecting namespace 'adobe:ns:meta/' and found 
> 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>
> 2867
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; 
> name=creator]
>
> 927
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=description]
>
> 723
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xmp/InDesign/private
>
> 710
>
> Invalid array type, expecting Bag and found Seq [prefix=dc; 
> name=subject]
>
> 654
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>
> 522
>
> Failed to parse
>
> 492
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=date]
>
> 370
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/illustrator/1.0/
>
> 262
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xfa/promoted-desc/
>
> 188
>
> Failed to instanciate property in xmp:CreateDate
>
> 144
>
> Schema is not set in this document : 
> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>
> 125
>
> Expecting local name 'xmpmeta' and found 'xapmeta'
>
> 94
>
> Cannot find a definition for the namespace 
> http://www.rwjf.org/rwjf/1.0
>
> 84
>
> Failed to instanciate property in xap:CreateDate
>
> 74
>
> Invalid array definition, expecting Bag and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=language]
>
> 68
>
> Invalid array definition, expecting Alt and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=title]
>
> 49
>
> Cannot find a definition for the namespace http://www.sap.com
>
> 46
>
> Failed to instanciate property in exif:ColorSpace
>
> 33
>
> Failed to instanciate property in xmpMM:History
>
> 28
>
> xmp should start with a processing instruction
>
> 26
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.0/
>
> 24
>
> Cannot find a definition for the namespace 
> http://www.npes.org/pdfx/ns/id/
>
> 21
>
> Cannot find a definition for the namespace 
> http://ns.InsiderSoftware.com/fontlist/1.0/
>
> 14
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=creator]
>
> 14
>
> Failed to instanciate property in xmp:MetadataDate
>
> 12
>
> Cannot find a definition for the namespace 
> http://ns.xinet.com/webnative/private/1.0/
>
> 10
>
> Failed to instanciate property in xap:ModifyDate
>
> 10
>
> Failed to instanciate property in xmp:ModifyDate
>
> 10
>
> Type 'params' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>
> 9
>
> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
> name=History]
>
>

RE: roadmap for XMPBox?

2016-03-08 Thread Allison, Timothy B.
> The comment I made is just my personal opinion. ... Maybe improve XMPBox as 
> you suggested (I did have a look but it doesn't seem easy).

Oh, ok, so it isn't necessarily set in stone.

What do other PDFBox devs think?  Is there interest in modifying XmpBox to be 
more lenient?  Not for 2.0.0, obviously... :)

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, March 08, 2016 12:56 PM
To: dev@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?

Am 08.03.2016 um 18:44 schrieb Allison, Timothy B.:
> Got it.  Thank you.  I wanted to confirm that nothing had changed since last 
> summer (PDFBOX-2855).
>
> Are you taking bug reports for jempbox or is that entirely eol'd?

Yes, I recently fixed a bug there.

> Any recommendations for a somewhat lenient, Apache license-compatible XMP 
> parser?

Sorry, don't know.

> Might it make sense to include in the README or in the package 
> javadocs something about the goals for XmpBox?  It is entirely 
> possible that I missed the warning. ;)

The comment I made is just my personal opinion. It's your comment that made me 
realize that with XMPBox, we can't parse some files that are not PDF/A 
compatible but are correct XMP files. I don't have an idea what to do. Maybe 
improve XMPBox as you suggested (I did have a look but it doesn't seem easy). 
Maybe resurrect Jempbox, or use the 1.8 version.

Tilman


>
> Thank you, again.
>
>  Best,
>
>Tim
>
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de]
> Sent: Tuesday, March 08, 2016 12:13 PM
> To: dev@pdfbox.apache.org
> Subject: Re: roadmap for XMPBox?
>
> I think the problem is that XmpBox was written for PDF/A checking, so it 
> fails with XMPs that are not PDF/A. For example, file 000142.pdf has the 
> schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_p
> roperties_in_pdfa-1_2008-03-20.pdf
>
> And no, there are no plans for anything on XMP at this time...
>
> Tilman
>
>
> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>> All,
>>
>>
>>
>> When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch 
>> from our current reliance on jempbox to XMPBox.  I recently extracted ~70k 
>> XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, 
>> there were exceptions on roughly 40% of the XMPs.
>>
>>
>>
>> I’m including a table below of the counts of exception messages.  Are 
>> there any plans to make XMPBox more lenient or is this what we can expect 
>> going forward?
>>
>>
>>
>> As always, I’m more than happy to help with files and tests.  Let me 
>> know what I can do.
>>
>>
>>
>>Cheers,
>>
>>
>>
>> Tim
>>
>>
>>
>> No XmpParsingException on 42,022 files.
>>
>>
>>
>>
>>
>>
>>
>> Exceptions:
>>
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/pdfx/1.3/
>>
>> 13403
>>
>> Type 'originalDocumentID' not defined in 
>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>
>> 3710
>>
>> Missing pdfaSchema:property in type definition
>>
>> 3113
>>
>> Expecting namespace 'adobe:ns:meta/' and found 
>> 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>
>> 2867
>>
>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>> name=creator]
>>
>> 927
>>
>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>> name=description]
>>
>> 723
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/xmp/InDesign/private
>>
>> 710
>>
>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>> name=subject]
>>
>> 654
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>
>> 522
>>
>> Failed to parse
>>
>> 492
>>
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=date]
>>
>> 370
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/illustrator/1.0/
>>
>> 262
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/xfa/promoted-desc/
>>
>> 188
>>
>&

RE: roadmap for XMPBox?

2016-03-08 Thread Allison, Timothy B.
Got it.  Thank you.  I wanted to confirm that nothing had changed since last 
summer (PDFBOX-2855).  

Are you taking bug reports for jempbox or is that entirely eol'd?  

Any recommendations for a somewhat lenient, Apache license-compatible XMP 
parser?

Might it make sense to include in the README or in the package javadocs 
something about the goals for XmpBox?  It is entirely possible that I missed 
the warning. ;)

Thank you, again.

Best,

  Tim

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Tuesday, March 08, 2016 12:13 PM
To: dev@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?

I think the problem is that XmpBox was written for PDF/A checking, so it fails 
with XMPs that are not PDF/A. For example, file 000142.pdf has the schema 
http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf

And no, there are no plans for anything on XMP at this time...

Tilman


Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
>When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch 
> from our current reliance on jempbox to XMPBox.  I recently extracted ~70k 
> XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, 
> there were exceptions on roughly 40% of the XMPs.
>
>
>
>I’m including a table below of the counts of exception messages.  Are 
> there any plans to make XMPBox more lenient or is this what we can expect 
> going forward?
>
>
>
>As always, I’m more than happy to help with files and tests.  Let me know 
> what I can do.
>
>
>
>   Cheers,
>
>
>
>Tim
>
>
>
> No XmpParsingException on 42,022 files.
>
>
>
>
>
>
>
> Exceptions:
>
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/pdfx/1.3/
>
> 13403
>
> Type 'originalDocumentID' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 3710
>
> Missing pdfaSchema:property in type definition
>
> 3113
>
> Expecting namespace 'adobe:ns:meta/' and found 
> 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>
> 2867
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; 
> name=creator]
>
> 927
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=description]
>
> 723
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xmp/InDesign/private
>
> 710
>
> Invalid array type, expecting Bag and found Seq [prefix=dc; 
> name=subject]
>
> 654
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>
> 522
>
> Failed to parse
>
> 492
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=date]
>
> 370
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/illustrator/1.0/
>
> 262
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xfa/promoted-desc/
>
> 188
>
> Failed to instanciate property in xmp:CreateDate
>
> 144
>
> Schema is not set in this document : 
> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>
> 125
>
> Expecting local name 'xmpmeta' and found 'xapmeta'
>
> 94
>
> Cannot find a definition for the namespace 
> http://www.rwjf.org/rwjf/1.0
>
> 84
>
> Failed to instanciate property in xap:CreateDate
>
> 74
>
> Invalid array definition, expecting Bag and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=language]
>
> 68
>
> Invalid array definition, expecting Alt and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=title]
>
> 49
>
> Cannot find a definition for the namespace http://www.sap.com
>
> 46
>
> Failed to instanciate property in exif:ColorSpace
>
> 33
>
> Failed to instanciate property in xmpMM:History
>
> 28
>
> xmp should start with a processing instruction
>
> 26
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.0/
>
> 24
>
> Cannot find a definition for the namespace 
> http://www.npes.org/pdfx/ns/id/
>
> 21
>
> Cannot find a definition for the namespace 
> http://ns.InsiderSoftware.com/fontlist/1.0/
>
> 14
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=creator]
>
> 14
>
> Failed to

RE: roadmap for XMPBox?

2016-03-07 Thread Allison, Timothy B.
XLSX summary and 89MB of XMPs available here: 

http://162.242.228.174/xmp_work/ 

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, March 07, 2016 1:31 PM
To: dev@pdfbox.apache.org
Subject: roadmap for XMPBox?

All,



  When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from 
our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from 
PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were 
exceptions on roughly 40% of the XMPs.



  I’m including a table below of the counts of exception messages.  Are there 
any plans to make XMPBox more lenient or is this what we can expect going 
forward?



  As always, I’m more than happy to help with files and tests.  Let me know 
what I can do.



 Cheers,



  Tim



No XmpParsingException on 42,022 files.







Exceptions:


Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/

13403

Type 'originalDocumentID' not defined in 
http://ns.adobe.com/xap/1.0/sType/ResourceRef#

3710

Missing pdfaSchema:property in type definition

3113

Expecting namespace 'adobe:ns:meta/' and found 
'http://www.w3.org/1999/02/22-rdf-syntax-ns#'

2867

Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]

927

Invalid array type, expecting Alt and found Seq [prefix=dc; name=description]

723

Cannot find a definition for the namespace 
http://ns.adobe.com/xmp/InDesign/private

710

Invalid array type, expecting Bag and found Seq [prefix=dc; name=subject]

654

Cannot find a definition for the namespace 
http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/

522

Failed to parse

492

Invalid array definition, expecting Seq and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=date]

370

Cannot find a definition for the namespace http://ns.adobe.com/illustrator/1.0/

262

Cannot find a definition for the namespace 
http://ns.adobe.com/xfa/promoted-desc/

188

Failed to instanciate property in xmp:CreateDate

144

Schema is not set in this document : http://www.w3.org/1999/02/22-rdf-syntax-ns#

125

Expecting local name 'xmpmeta' and found 'xapmeta'

94

Cannot find a definition for the namespace http://www.rwjf.org/rwjf/1.0

84

Failed to instanciate property in xap:CreateDate

74

Invalid array definition, expecting Bag and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
name=language]

68

Invalid array definition, expecting Alt and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title]

49

Cannot find a definition for the namespace http://www.sap.com

46

Failed to instanciate property in exif:ColorSpace

33

Failed to instanciate property in xmpMM:History

28

xmp should start with a processing instruction

26

Cannot find a definition for the namespace 
http://prismstandard.org/namespaces/basic/2.0/

24

Cannot find a definition for the namespace http://www.npes.org/pdfx/ns/id/

21

Cannot find a definition for the namespace 
http://ns.InsiderSoftware.com/fontlist/1.0/

14

Invalid array definition, expecting Seq and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
name=creator]

14

Failed to instanciate property in xmp:MetadataDate

12

Cannot find a definition for the namespace 
http://ns.xinet.com/webnative/private/1.0/

10

Failed to instanciate property in xap:ModifyDate

10

Failed to instanciate property in xmp:ModifyDate

10

Type 'params' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceEvent#

9

Invalid array type, expecting Seq and found Bag [prefix=xmpMM; name=History]

8

Type 'documentName' not defined in 
http://ns.adobe.com/xap/1.0/sType/ResourceRef#

8

Cannot find a definition for the namespace http://www.day.com/dam/1.0

7

Cannot find a definition for the namespace ptc

7

Failed to instanciate property in xapMM:History

6

Invalid array definition, expecting Seq and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
name=YCbCrPositioning]

5

Schema is not set in this document : http://purl.org/dc/elements/1.1/

5

Cannot find a definition for the namespace 
http://www.extensis.com/meta/FontSense/

4

Excepted xpacket 'end' attribute (must be present and placed in first)

4

Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
name=TextLayers]

3

Schema is not set in this document : http://ns.adobe.com/xap/1.0/

3

no message (NPE)

2

Cannot find a definition for the namespace http://laserfiche.com/xmp/schema/1.0/

2

Cannot find a definition for the namespace 
http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/

2

Cannot find a definition for the namespace 
http://ns.adobe.com/camera-raw-settings/1.0/

2

Failed to instanciate property in xapRights:Marked

2

Invalid array type, expecting Alt and found Bag [prefix=dc; name=titl

roadmap for XMPBox?

2016-03-07 Thread Allison, Timothy B.
All,



  When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from 
our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from 
PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were 
exceptions on roughly 40% of the XMPs.



  I’m including a table below of the counts of exception messages.  Are there 
any plans to make XMPBox more lenient or is this what we can expect going 
forward?



  As always, I’m more than happy to help with files and tests.  Let me know 
what I can do.



 Cheers,



  Tim



No XmpParsingException on 42,022 files.







Exceptions:


Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/

13403

Type 'originalDocumentID' not defined in 
http://ns.adobe.com/xap/1.0/sType/ResourceRef#

3710

Missing pdfaSchema:property in type definition

3113

Expecting namespace 'adobe:ns:meta/' and found 
'http://www.w3.org/1999/02/22-rdf-syntax-ns#'

2867

Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]

927

Invalid array type, expecting Alt and found Seq [prefix=dc; name=description]

723

Cannot find a definition for the namespace 
http://ns.adobe.com/xmp/InDesign/private

710

Invalid array type, expecting Bag and found Seq [prefix=dc; name=subject]

654

Cannot find a definition for the namespace 
http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/

522

Failed to parse

492

Invalid array definition, expecting Seq and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=date]

370

Cannot find a definition for the namespace http://ns.adobe.com/illustrator/1.0/

262

Cannot find a definition for the namespace 
http://ns.adobe.com/xfa/promoted-desc/

188

Failed to instanciate property in xmp:CreateDate

144

Schema is not set in this document : http://www.w3.org/1999/02/22-rdf-syntax-ns#

125

Expecting local name 'xmpmeta' and found 'xapmeta'

94

Cannot find a definition for the namespace http://www.rwjf.org/rwjf/1.0

84

Failed to instanciate property in xap:CreateDate

74

Invalid array definition, expecting Bag and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
name=language]

68

Invalid array definition, expecting Alt and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title]

49

Cannot find a definition for the namespace http://www.sap.com

46

Failed to instanciate property in exif:ColorSpace

33

Failed to instanciate property in xmpMM:History

28

xmp should start with a processing instruction

26

Cannot find a definition for the namespace 
http://prismstandard.org/namespaces/basic/2.0/

24

Cannot find a definition for the namespace http://www.npes.org/pdfx/ns/id/

21

Cannot find a definition for the namespace 
http://ns.InsiderSoftware.com/fontlist/1.0/

14

Invalid array definition, expecting Seq and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
name=creator]

14

Failed to instanciate property in xmp:MetadataDate

12

Cannot find a definition for the namespace 
http://ns.xinet.com/webnative/private/1.0/

10

Failed to instanciate property in xap:ModifyDate

10

Failed to instanciate property in xmp:ModifyDate

10

Type 'params' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceEvent#

9

Invalid array type, expecting Seq and found Bag [prefix=xmpMM; name=History]

8

Type 'documentName' not defined in 
http://ns.adobe.com/xap/1.0/sType/ResourceRef#

8

Cannot find a definition for the namespace http://www.day.com/dam/1.0

7

Cannot find a definition for the namespace ptc

7

Failed to instanciate property in xapMM:History

6

Invalid array definition, expecting Seq and found 
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
name=YCbCrPositioning]

5

Schema is not set in this document : http://purl.org/dc/elements/1.1/

5

Cannot find a definition for the namespace 
http://www.extensis.com/meta/FontSense/

4

Excepted xpacket 'end' attribute (must be present and placed in first)

4

Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
name=TextLayers]

3

Schema is not set in this document : http://ns.adobe.com/xap/1.0/

3

no message (NPE)

2

Cannot find a definition for the namespace http://laserfiche.com/xmp/schema/1.0/

2

Cannot find a definition for the namespace 
http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/

2

Cannot find a definition for the namespace 
http://ns.adobe.com/camera-raw-settings/1.0/

2

Failed to instanciate property in xapRights:Marked

2

Invalid array type, expecting Alt and found Bag [prefix=dc; name=title]

2

Invalid array type, expecting Alt and found Seq [prefix=dc; name=title]

2

Invalid array type, expecting Seq and found Alt [prefix=dc; name=creator]

2

Cannot find a definition for the namespace 
http://ns.cambridgeassociates.com/status/1.0/

1

Cannot find a definition for the namespace 
http://ns.computershare.com.au/ccs/1.0/

1

Cannot f

RE: [VOTE] Release Apache PDFBox 1.8.11

2016-01-13 Thread Allison, Timothy B.
Turns out there are the same exceptions with those combinations of java 
versions and OS for 1.8.10.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Tuesday, January 12, 2016 1:49 PM
To: dev@pdfbox.apache.org
Subject: RE: [VOTE] Release Apache PDFBox 1.8.11

Ah, ok. Thank you.

With the following on linux, all is well:
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17) Java HotSpot(TM) 64-Bit 
Server VM (build 25.66-b17, mixed mode)


The test failures were with:
Linux
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13) OpenJDK 64-Bit 
Server VM (build 24.75-b04, mixed mode)

and

Windows:
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17) Java HotSpot(TM) 64-Bit 
Server VM (build 25.66-b17, mixed mode)


-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Tuesday, January 12, 2016 1:33 PM
To: dev@pdfbox.apache.org
Subject: Re: [VOTE] Release Apache PDFBox 1.8.11

Hmmm,


everything works fine for me after a fresh checkout, at least on linux.

Maybe some issue with the jdk? Which one are you using for your tests? I ran 
into some problems (test failures during rendering) whenever using the openjdk 
which comes with fedora by default. Those disappear once I switch to oracle jdk.

BR,
Andreas


Am 12.01.2016 um 18:58 schrieb Allison, Timothy B.:
> All,
>
> Is this user error?  I'm getting 3 test exceptions in both Windows and 
> Linux in the preflight module after I did an svn checkout from:
> http://svn.apache.org/repos/asf/pdfbox/tags/1.8.11/
>
> Revision: 1724292
> Node Kind: directory
> Schedule: normal
> Last Changed Author: lehmi
> Last Changed Rev: 1724120
> Last Changed Date: 2016-01-11 14:26:57 -0500 (Mon, 11 Jan 2016)
>
>
> In RHEL:
> Failed tests:   
> testAllInfoSynhcronized(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation):
>  expected:<0> but was:<2>
>
> testBadPrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynchron
> izedMetadataValidation): null expected:<7.[4.]2> but was:<7.[]2>
>
> testdoublePrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynch
> ronizedMetadataValidation)
>
> Tests run: 72, Failures: 3, Errors: 0, Skipped: 0
>
> In Windows:
> "C:\Program Files\Java\jdk1.8\bin\java" true System property 
> 'pdfa.invalid' not defined, will not run TestValidaDirectory
> TestIsartorValidationFromClasspath2.initializeParameters(): No input 
> files found System property 'pdfa.valid' not defined, will not run 
> TestValidaDirectory
>
> junit.framework.AssertionFailedError:
> Expected :0
> Actual   :2
>
>
>
>
>   at junit.framework.Assert.fail(Assert.java:47)
>   at junit.framework.Assert.failNotEquals(Assert.java:283)
>   at junit.framework.Assert.assertEquals(Assert.java:64)
>   at junit.framework.Assert.assertEquals(Assert.java:195)
>   at junit.framework.Assert.assertEquals(Assert.java:201)
>   at 
> org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation.testAllInfoSynhcronized(TestSynchronizedMetadataValidation.java:422)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(Ru

RE: [VOTE] Release Apache PDFBox 1.8.11

2016-01-12 Thread Allison, Timothy B.
Ah, ok. Thank you.

With the following on linux, all is well:
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)


The test failures were with:
Linux
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

and

Windows:
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)


-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Tuesday, January 12, 2016 1:33 PM
To: dev@pdfbox.apache.org
Subject: Re: [VOTE] Release Apache PDFBox 1.8.11

Hmmm,


everything works fine for me after a fresh checkout, at least on linux.

Maybe some issue with the jdk? Which one are you using for your tests? I ran 
into some problems (test failures during rendering) whenever using the openjdk 
which comes with fedora by default. Those disappear once I switch to oracle jdk.

BR,
Andreas


Am 12.01.2016 um 18:58 schrieb Allison, Timothy B.:
> All,
>
> Is this user error?  I'm getting 3 test exceptions in both Windows and 
> Linux in the preflight module after I did an svn checkout from: 
> http://svn.apache.org/repos/asf/pdfbox/tags/1.8.11/
>
> Revision: 1724292
> Node Kind: directory
> Schedule: normal
> Last Changed Author: lehmi
> Last Changed Rev: 1724120
> Last Changed Date: 2016-01-11 14:26:57 -0500 (Mon, 11 Jan 2016)
>
>
> In RHEL:
> Failed tests:   
> testAllInfoSynhcronized(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation):
>  expected:<0> but was:<2>
>
> testBadPrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation):
>  null expected:<7.[4.]2> but was:<7.[]2>
>
> testdoublePrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynch
> ronizedMetadataValidation)
>
> Tests run: 72, Failures: 3, Errors: 0, Skipped: 0
>
> In Windows:
> "C:\Program Files\Java\jdk1.8\bin\java" true System property 
> 'pdfa.invalid' not defined, will not run TestValidaDirectory
> TestIsartorValidationFromClasspath2.initializeParameters(): No input 
> files found System property 'pdfa.valid' not defined, will not run 
> TestValidaDirectory
>
> junit.framework.AssertionFailedError:
> Expected :0
> Actual   :2
>
>
>
>
>   at junit.framework.Assert.fail(Assert.java:47)
>   at junit.framework.Assert.failNotEquals(Assert.java:283)
>   at junit.framework.Assert.assertEquals(Assert.java:64)
>   at junit.framework.Assert.assertEquals(Assert.java:195)
>   at junit.framework.Assert.assertEquals(Assert.java:201)
>   at 
> org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation.testAllInfoSynhcronized(TestSynchronizedMetadataValidation.java:422)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:24)
>   at org.junit.runners.ParentRunner$3.run(Pare

RE: [VOTE] Release Apache PDFBox 1.8.11

2016-01-12 Thread Allison, Timothy B.
All,

Is this user error?  I'm getting 3 test exceptions in both Windows and Linux in 
the preflight module after I did an svn checkout from: 
http://svn.apache.org/repos/asf/pdfbox/tags/1.8.11/

Revision: 1724292
Node Kind: directory
Schedule: normal
Last Changed Author: lehmi
Last Changed Rev: 1724120
Last Changed Date: 2016-01-11 14:26:57 -0500 (Mon, 11 Jan 2016)


In RHEL:
Failed tests:   
testAllInfoSynhcronized(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation):
 expected:<0> but was:<2>
  
testBadPrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation):
 null expected:<7.[4.]2> but was:<7.[]2>
  
testdoublePrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation)

Tests run: 72, Failures: 3, Errors: 0, Skipped: 0

In Windows:
"C:\Program Files\Java\jdk1.8\bin\java" true
System property 'pdfa.invalid' not defined, will not run TestValidaDirectory
TestIsartorValidationFromClasspath2.initializeParameters(): No input files found
System property 'pdfa.valid' not defined, will not run TestValidaDirectory

junit.framework.AssertionFailedError: 
Expected :0
Actual   :2
  



at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.failNotEquals(Assert.java:283)
at junit.framework.Assert.assertEquals(Assert.java:64)
at junit.framework.Assert.assertEquals(Assert.java:195)
at junit.framework.Assert.assertEquals(Assert.java:201)
at 
org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation.testAllInfoSynhcronized(TestSynchronizedMetadataValidation.java:422)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:24)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:234)
at 
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)


junit.framework.ComparisonFailure: null 
Expected :7.4.2
Actual   :7.2
  



at junit.framework.Assert.assertEquals(Assert.java:81)
at junit.framework.Assert.assertEquals(Assert.java:87)
at 
org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation.testBadPrefixSchemas(TestSynchronizedMetadataValidation.java:499)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.

comparison of 1.8.10 and 2.0 trunk

2015-10-23 Thread Allison, Timothy B.
All,

  Apologies for the delay.  I finally finished the comparison of text extracted 
from 100k pdfs with 1.8.10 and 2.0 trunk (pdfbox-2.0.0-20151022.051152-1783).
The reports are available here [0].  I botched the commit message...

  I haven't had a chance to review the results.  The eval code is still in 
development and there might be bugs! To view the docs, prepend: h t t p : slash 
slash one six two . two four two . two two eight . one seven four/docs/  ... 
just don't let any of the scrapers read that. ;)  The docs include all those 
within our corpus that had a rtl word (when extracted with 1.8.10 :)) and then 
I took a random selection to fill out ~100k pdfs from common crawl and govdocs1.

  Let me know if you have any questions.

  Cheers,

 Tim


[0] 
https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip



RE: Subclassing BaseParser?

2015-10-05 Thread Allison, Timothy B.
Nope, not missing anything...that did it, of course.  Sorry. Seems like more 
overhead than we need for this use, but that works.  Will go with that.  Thank 
you!

-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, October 05, 2015 3:07 PM
To: dev@pdfbox.apache.org
Subject: Re: Subclassing BaseParser?

John did that one and he's very sensitive on making stuff public. What prevents 
you from extending COSParser as in that example code I posted at that time? Or 
am I missing something, i.e. was this for something different?

Tilman

Am 05.10.2015 um 13:25 schrieb Allison, Timothy B.:
>
> [switching to dev because this is entering into dev land]
>
> Y, I did and do have it working for the 1.8.x branch.  I either had it 
> working for the 2.0 branch before the change to SequentialSource was 
> made, or there's a chance that I never got around to integrating it 
> into our dev wrapper for 2.0. LHappy to be back working on 2.0, though!
>
> Is there any chance of making SequentialSource and its friends public 
> or possibly offering a RandomAccessRead constructor for BaseParser?
> Or, is there another cleaner solution to allow subclassing of 
> BaseParser outside of o.a.p.pdfparser?
>
> Plan D: move the "fixing" of metadata strings that are improperly 
> PDFEncoded into PDFBox.
>
> Thank you!
>
> Best,
>
> Tim
>
> *From:*Tilman Hausherr [mailto:thaush...@t-online.de]
> *Sent:* Sunday, October 04, 2015 8:34 AM
> *To:* us...@pdfbox.apache.org
> *Subject:* Re: Subclassing BaseParser?
>
> Am 03.10.2015 um 21:13 schrieb Allison, Timothy B.:
>
> All,
>
>I'm probably suffering from the same failure that led to 
> (https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370),
>  but is it possible to subclass BaseParser outside of the oap.pdfparser 
> package?
>
>The actual subclassing of BaseParser is no problem, but what can I 
> substitute for SequentialSource, given that it and RandomAccessSource are 
> package-private?
>
>
> But later in that issue, you wrote that "all is well", so I didn't 
> bother. But it is true that currently, BaseParser can only be extended 
> within its package, due to RandomAccessSource and SequentialSource.
> There's even a netbeans warning because of that.
>
>
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: Subclassing BaseParser?

2015-10-05 Thread Allison, Timothy B.
[switching to dev because this is entering into dev land]

Y, I did and do have it working for the 1.8.x branch.  I either had it working 
for the 2.0 branch before the change to SequentialSource was made, or there's a 
chance that I never got around to integrating it into our dev wrapper for 2.0.  
:(  Happy to be back working on 2.0, though!

Is there any chance of making SequentialSource and its friends public or 
possibly offering a RandomAccessRead constructor for BaseParser?  Or, is there 
another cleaner solution to allow subclassing of BaseParser outside of 
o.a.p.pdfparser?

Plan D: move the "fixing" of metadata strings that are improperly PDFEncoded 
into PDFBox.

Thank you!

Best,

   Tim

From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Sunday, October 04, 2015 8:34 AM
To: us...@pdfbox.apache.org
Subject: Re: Subclassing BaseParser?

Am 03.10.2015 um 21:13 schrieb Allison, Timothy B.:

All,



  I'm probably suffering from the same failure that led to 
(https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370),
 but is it possible to subclass BaseParser outside of the oap.pdfparser package?



  The actual subclassing of BaseParser is no problem, but what can I substitute 
for SequentialSource, given that it and RandomAccessSource are package-private?



But later in that issue, you wrote that "all is well", so I didn't bother. But 
it is true that currently, BaseParser can only be extended within its package, 
due to RandomAccessSource and SequentialSource. There's even a netbeans warning 
because of that.

[cid:image001.png@01D0FF3E.1BA8B1C0]



RE: help debugging integration of PDFBox 2.0.0 trunk

2015-07-20 Thread Allison, Timothy B.

>>Xmx doesn't limit native memory, so if there's a leak associated with AWT, 
>>ImageIO C libraries, or some other JNI library, the process can grow without 
>>limit. Such a leak could be due to a bug, or us not calling close() somewhere.

Got it.  Ok.  Is there anything I can do to help figure out what's going on?

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: help debugging integration of PDFBox 2.0.0 trunk

2015-07-20 Thread Allison, Timothy B.
With  ~125k files, and there were 10 restarts, 7x with exit code=137 and 2x 
with exit code=1.  The exit code=253 was a timeout for: 26.pdf.

Happens roughly every 8-10 minutes.

502907 2015-07-20 17:13:24,420 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=137 numRestarts=0 receivedRestartMessage=false)
986787 2015-07-20 17:21:28,300 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=253 numRestarts=1 receivedRestartMessage=false)
1574818 2015-07-20 17:31:16,331 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=137 numRestarts=2 receivedRestartMessage=false)
2040741 2015-07-20 17:39:02,254 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=137 numRestarts=3 receivedRestartMessage=false)
2545702 2015-07-20 17:47:27,215 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=137 numRestarts=4 receivedRestartMessage=false)
3084672 2015-07-20 17:56:26,185 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=137 numRestarts=5 receivedRestartMessage=false)
3571616 2015-07-20 18:04:33,129 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=1 numRestarts=6 receivedRestartMessage=false)
4021342 2015-07-20 18:12:02,855 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=1 numRestarts=7 receivedRestartMessage=false)
4503161 2015-07-20 18:20:04,674 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=137 numRestarts=8 receivedRestartMessage=false)
4958976 2015-07-20 18:27:40,489 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Must restart process 
(exitValue=137 numRestarts=9 receivedRestartMessage=false)
5437962 2015-07-20 18:35:39,475 [main] WARN  
org.apache.tika.batch.BatchProcessDriverCLI  - Hit the maximum number of 
process restarts. Driver is shutting down now.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, July 20, 2015 3:18 PM
To: dev@pdfbox.apache.org
Subject: RE: help debugging integration of PDFBox 2.0.0 trunk

Y, sorry, Tilman.  I'm not running into problems with 1.8.9 and straight text 
extraction, though.

Following Timo's recommendation...looks like a memory issue.  Let me know if I 
should post the full file or move to a more recent version of Java. :)

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 403177472 bytes for 
committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
...
#  Out of Memory Error (os_linux.cpp:2798), pid=14958, tid=140419564971776
...
vm_info: OpenJDK 64-Bit Server VM (24.75-b04) for linux-amd64 JRE 
(1.7.0_75-b13), built on Jan 16 2015 09:15:47 by "mockbuild" with gcc 4.8.2 
20140120 (Red Hat 4.8.2-16)


-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, July 20, 2015 1:28 PM
To: dev@pdfbox.apache.org
Subject: Re: help debugging integration of PDFBox 2.0.0 trunk

Am 20.07.2015 um 18:12 schrieb Allison, Timothy B.:
> All,
>While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm 
> finding two issues that are difficult to reproduce.
>
> Background:
> Tika-batch has a parent process that kicks off a Tika processor in a child 
> process, if that dies unexpectedly, the parent kicks it off again.  I'm 
> running with 10 consumer/parser threads and -Xmx5g on an (8 cpu/8GB vm); RHEL 
> 7, Linux cloud-server-02 3.10.0-123.20.1.el7.x86_64 #1 SMP Wed Jan 21 
> 09:45:55 EST 2015 x86_64 x86_64 x86_64 GNU/Linux)
>
> Two problems:
>
> 1)  The child process exits with value 1. I'm catching Throwable around 
> the primary execution call in the child process and logging it; nothing shows 
> up in the log files from that part of the code. From the parser log files (at 
> trace), I can tell which 10 files were being processed at the time, but I'm 
> not seeing any other information about what caused the exit.  When I run 
> against just those 10 files, all is ok.
>
> 2)  The OS is killing the child far more often than it does with 1.8.9 
> (exit code 137).
>
> For the second problem, I'll wait until the optimizations to the caching are 
> completed before I start worrying about that.  However, do you have any 
> recommendations on how to figure out what's going on with 1)?

I'm also having some problem with that system... with my test software, 
I have observed that java uses more and more space, despite it being 
told not to use more than a certain amount with -Xmx. After some time, 
the "process killer" kills the a

RE: help debugging integration of PDFBox 2.0.0 trunk

2015-07-20 Thread Allison, Timothy B.
Y, sorry, Tilman.  I'm not running into problems with 1.8.9 and straight text 
extraction, though.

Following Timo's recommendation...looks like a memory issue.  Let me know if I 
should post the full file or move to a more recent version of Java. :)

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 403177472 bytes for 
committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
...
#  Out of Memory Error (os_linux.cpp:2798), pid=14958, tid=140419564971776
...
vm_info: OpenJDK 64-Bit Server VM (24.75-b04) for linux-amd64 JRE 
(1.7.0_75-b13), built on Jan 16 2015 09:15:47 by "mockbuild" with gcc 4.8.2 
20140120 (Red Hat 4.8.2-16)


-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de] 
Sent: Monday, July 20, 2015 1:28 PM
To: dev@pdfbox.apache.org
Subject: Re: help debugging integration of PDFBox 2.0.0 trunk

Am 20.07.2015 um 18:12 schrieb Allison, Timothy B.:
> All,
>While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm 
> finding two issues that are difficult to reproduce.
>
> Background:
> Tika-batch has a parent process that kicks off a Tika processor in a child 
> process, if that dies unexpectedly, the parent kicks it off again.  I'm 
> running with 10 consumer/parser threads and -Xmx5g on an (8 cpu/8GB vm); RHEL 
> 7, Linux cloud-server-02 3.10.0-123.20.1.el7.x86_64 #1 SMP Wed Jan 21 
> 09:45:55 EST 2015 x86_64 x86_64 x86_64 GNU/Linux)
>
> Two problems:
>
> 1)  The child process exits with value 1. I'm catching Throwable around 
> the primary execution call in the child process and logging it; nothing shows 
> up in the log files from that part of the code. From the parser log files (at 
> trace), I can tell which 10 files were being processed at the time, but I'm 
> not seeing any other information about what caused the exit.  When I run 
> against just those 10 files, all is ok.
>
> 2)  The OS is killing the child far more often than it does with 1.8.9 
> (exit code 137).
>
> For the second problem, I'll wait until the optimizations to the caching are 
> completed before I start worrying about that.  However, do you have any 
> recommendations on how to figure out what's going on with 1)?

I'm also having some problem with that system... with my test software, 
I have observed that java uses more and more space, despite it being 
told not to use more than a certain amount with -Xmx. After some time, 
the "process killer" kills the application.

Seems something changed in java memory management:
http://karunsubramanian.com/websphere/one-important-change-in-memory-management-in-java-8/

I did some investigation on this a few months ago, but gave up out of 
frustration.

Tilman

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



  1   2   >