RE: PDFBox 2.0.9 release?
> PDFBOX-4153 is solved. How about cutting the release next Monday? +1 and thank you! Tim
RE: PDFBox 2.0.9 release?
Reports are available here: http://162.242.228.174/reports/pdfbox-2.0.9-pre-rc1_reports_2.tar.bz2
RE: PDFBox 2.0.9 release?
> ok => Tim, please start again Will start now.
RE: PDFBox 2.0.9 release?
I'm happy to run the regression tests again when all final changes for 2.0.9-RC1 are made. I'm really excited to be able to include jbig2. We'll start the Tika release process for 1.18 as soon as PDFBox 2.0.9 is available. Thank you, all! Cheers, Tim
RE: PDFBox 2.0.9 release?
http://162.242.228.174/reports/pdfbox-2.0.9-pre-rc1_reports.tar.bz2 Looks good to me. Only 3 new exceptions (all on truncated files), more common words. No page diffs, no attachment diffs. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Thursday, March 8, 2018 3:52 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.9 release? Am 08.03.2018 um 21:35 schrieb Allison, Timothy B.: > I've kicked off an initial run with 2.0.9-SNAPSHOT on the regression corpus. > While I had some time, I wanted to see if there were any early indicators of > problems. > > Tilman, I didn't mean to steal this task from you! We'll probably need > another run once there's agreement that 2.0.9-SNAPSHOT is really, truly ready > for rc1. No problem. I've been too busy due the many excellent patches we got in February and March, and now I'm somewhat exhausted. I'll be back in better shape on saturday and will analyse the results. Tilman > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Wednesday, March 7, 2018 8:03 AM > To: dev@pdfbox.apache.org > Subject: RE: PDFBox 2.0.9 release? > > Argh. Sorry for my delay. Y. I have time, and I'm happy to help Tilman if > he'd prefer to lead the regression testing process again. > > Cheers, > > Tim > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: PDFBox 2.0.9 release?
I've kicked off an initial run with 2.0.9-SNAPSHOT on the regression corpus. While I had some time, I wanted to see if there were any early indicators of problems. Tilman, I didn't mean to steal this task from you! We'll probably need another run once there's agreement that 2.0.9-SNAPSHOT is really, truly ready for rc1. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, March 7, 2018 8:03 AM To: dev@pdfbox.apache.org Subject: RE: PDFBox 2.0.9 release? Argh. Sorry for my delay. Y. I have time, and I'm happy to help Tilman if he'd prefer to lead the regression testing process again. Cheers, Tim
RE: PDFBox 2.0.9 release?
Argh. Sorry for my delay. Y. I have time, and I'm happy to help Tilman if he'd prefer to lead the regression testing process again. Cheers, Tim -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Monday, March 5, 2018 1:28 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.9 release? [resending, as my first attempt swallowed my second command due to a wrong formatting] Am 04.03.2018 um 13:33 schrieb Tilman Hausherr: > I have the time and I should do it, because I lost my notes from last > time, which had some hints and command lines that go beyond the > documentation on the web. These notes were on a USB stick that is > attached to my keyboard that is attached to my 2 PCs via a switch, so > I could access these notes regardless which PC is on. That USB stick > was recently destroyed (thank you, KINGSTON!) by a static discharge likely > related to the dry winter air. Argh, I know what you mean. I have to fight with the fedora update process from time to time :-( > However I need a few days to finish the issues I am working on, and > the issues targeted for 2.0.9. So monday next week would be too early. We are not in a hurry, take your time ... Andreas > > Tilman > > Am 04.03.2018 um 12:50 schrieb Andreas Lehmkuehler: >> Hi, >> >> now that we got the JBIG2 ImageIO out of the door it's time to >> release a new 2.0.x version of PDFBox. >> >> WDYT? >> >> @Tim, @Tilman >> Do you have some time to run the regression test? >> >> Andreas >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >> additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: Running tika-eval on the Rackspace vm
Great! Thank you, Tilman! I updated the wiki based on your feedback. Let me know if I should add anything else while the experience is fresh. Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, November 6, 2017 3:00 PM To: dev@pdfbox.apache.org Subject: Re: Running tika-eval on the Rackspace vm I think I was successful, the report now makes sense, as if Tim had created it himself :-) The two issues I just created are related to a comparison between 2.0.8 and 2.0.4. So for that next board report, we can now (additional to the existing text) tell that there is now a second committer who can run the tests. Tilman Am 05.11.2017 um 22:06 schrieb Tilman Hausherr: > I've come closer to find out what's happening. I found out that > tika-app was running with PDFBox 2.0.7 all the time regardless of what > pdfbox version is in the pom.xml. > > Apparently, building tika-app uses tika-parsers from the repository > (instead building tika-parsers it again), which needs 2.0.7. > Explicitely building tika-parsers before building tika-app helps. > > This is new to me, in PDFBox if one builds the app all dependencies > are built as well. > > Tilman > > Am 04.11.2017 um 14:48 schrieb Tilman Hausherr: >> So it's done: >> /work/eval/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports_03112017 >> >> I wonder why the differences are so few, especially in meta where I >> KNOW that there are differences, due to the handling of empty strings >> with BOM. Maybe it is because I skipped the "A" phase and used >> existing data from a 2.0.4 run that I found, or because I use a >> current tika trunk and not the existing binary that was on the server. >> >> I'm thinking of creating a new "A" with 2.0.4 with current tika trunk >> and then compare with the "B" I did. >> >> Tilman >> >> >> Am 03.11.2017 um 22:14 schrieb Tilman Hausherr: >>> Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.: >>>> I'm not sure what you mean by...sorry >>>>> - "H" is missing, which is identical to "C" >>> >>> >>> I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM >>> >>> In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing. >>> Of course it is obvious that it has to be done, but I am a >>> perfectionist. I'd like to have this documentation for the "me" in a >>> few months when I have forgotten what I did the last days. Or for >>> the next person. >>> >>> Thanks for the fixes you did. I wonder why writing to /tmp didn't >>> work - it did work from the command line. I've started the command >>> again, I'm not sure when I will report about it. I'm a bit exhausted >>> from non-software activities :-( >>> >>> Tilman >>> >>> >>> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >> additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: Running tika-eval on the Rackspace vm
Tilman, Thank you for the toe-stubbing. I'm sorry that it wasn't easier... I created a new user with collab permissions and ran through the process. You are right about the privileges on the tmp directory... POI needs a tmp directory to write xlsx. I created a tmp directory in /work/eval and added a direction to set tmp dir via -Djava.io.tmpdir=tmp I'm not sure what you mean by...sorry >- "H" is missing, which is identical to "C" I updated the permissions on appBatchExecutor.sh I also added a recommendation to umask g+rw before starting. Let me know if I need to fix anything else or if I missed something you've already identified but I missed. ☹ Thank you, again. Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Thursday, November 2, 2017 5:47 PM To: dev@pdfbox.apache.org Subject: Re: Running tika-eval on the Rackspace vm I'm almost done... then I got this when doing the last step: [tilman@cloud-server-02 eval]$ java -jar tika-eval-1.17-SNAPSHOT.jar Report -db pdfboxAvsB 0 [main] INFO org.apache.tika.eval.reports.Report - Writing report: All Mimes In A to mimes/all_mimes_A.xlsx Exception in thread "main" java.io.IOException: Permission denied at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createTempFile(File.java:2024) at org.apache.poi.util.DefaultTempFileCreationStrategy.createTempFile(DefaultTempFileCreationStrategy.java:110) at org.apache.poi.util.TempFile.createTempFile(TempFile.java:66) at org.apache.poi.xssf.streaming.SXSSFWorkbook.write(SXSSFWorkbook.java:924) at org.apache.tika.eval.reports.Report.dumpXLSX(Report.java:85) at org.apache.tika.eval.reports.Report.writeReport(Report.java:64) at org.apache.tika.eval.reports.ResultsReporter.execute(ResultsReporter.java:305) at org.apache.tika.eval.reports.ResultsReporter.main(ResultsReporter.java:266) at org.apache.tika.eval.TikaEvalCLI.handleReport(TikaEvalCLI.java:264) at org.apache.tika.eval.TikaEvalCLI.execute(TikaEvalCLI.java:52) at org.apache.tika.eval.TikaEvalCLI.main(TikaEvalCLI.java:273) I changed the source, and now I got the path, it is /work/eval/reports/mimes/all_mimes_A.xlsx . The file exists and it is empty. I tried with a 1.16 version and the same happened. Then I thought, maybe the file with the permission problem isn't the target at all; could this be some temp file / temp directory where I don't have permission? smaller improvements for the documentation: - appBatchExecutor.sh should have 775 permission or the documentation should have "nohup sh ./appBatchExecutor.sh &" - "H" is missing, which is identical to "C" - mention that "pdfboxAvsB" db files are to be removed before starting? I had accidentally aborted a run and couldn't restart. Tilman memo for me: java -jar tika-eval-1.17-SNAPSHOT.jar Compare -extractsA /data4/batch_runs/pdfbox_2_0_4 -extractsB /data4/batch_runs/pdfbox_2_0_9-SNAPSHOT1 -db pdfboxAvsB java -jar tika-eval-1.17-SNAPSHOT.jar Report -db pdfboxAvsB
RE: Running tika-eval on the Rackspace vm
Sorry. Fixed. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, October 31, 2017 6:08 PM To: dev@pdfbox.apache.org Subject: Re: Running tika-eval on the Rackspace vm Am 31.10.2017 um 20:53 schrieb Allison, Timothy B.: >> It's not possible to rename / remove the files / directories mentioned in >> part 1 due to not having the permissions. > Gah. Sorry. Tilman, I added you to "collab" and chgrp to collab on /work > /data2/docs /data3/batch_runs and /data4/batch_runs. But the directories themselves don't have "w" rights for group so I can't profit from my membership... (unless I missed something, I haven't done much *nix since the 90ies) For example I can't rename /work/batch-apps/tika_working/logs to /work/batch-apps/tika_working/___logs . Tilman > >> The directory is named batch-apps, not batch_apps. > Fixed. Thank you. > >> Re the "A" version - is this the "good" version, so I could simply download >> tika-app and put it there? Or just build tika with a specific PDFBox >> version? > If the current version of tika-app has the right version of PDFBox for your > "before" examples, then y, you can just download tika-app.jar. We release > less frequently than PDFBox, so it's possible that you'll want to build from > scratch with the most recent previous release of PDFBox. > > In my mind, A is the "before/baseline" version and B is the > SNAPSHOT/RC version. So, hopefully, B is the "good" one. 😊 > > Let me know what other problems you encounter. > > Cheers, > > Tim > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: Running tika-eval on the Rackspace vm
> It's not possible to rename / remove the files / directories mentioned in > part 1 due to not having the permissions. Gah. Sorry. Tilman, I added you to "collab" and chgrp to collab on /work /data2/docs /data3/batch_runs and /data4/batch_runs. > The directory is named batch-apps, not batch_apps. Fixed. Thank you. > Re the "A" version - is this the "good" version, so I could simply download > tika-app and put it there? Or just build tika with a specific PDFBox version? If the current version of tika-app has the right version of PDFBox for your "before" examples, then y, you can just download tika-app.jar. We release less frequently than PDFBox, so it's possible that you'll want to build from scratch with the most recent previous release of PDFBox. In my mind, A is the "before/baseline" version and B is the SNAPSHOT/RC version. So, hopefully, B is the "good" one. 😊 Let me know what other problems you encounter. Cheers, Tim
RE: Running tika-eval on the Rackspace vm
Will fix both. Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, October 30, 2017 4:21 PM To: dev@pdfbox.apache.org Subject: Re: Running tika-eval on the Rackspace vm It's not possible to rename / remove the files / directories mentioned in part 1 due to not having the permissions. Tilman Am 30.10.2017 um 14:14 schrieb Tilman Hausherr: > I almost had some time today, so I had a look at > https://wiki.apache.org/tika/TikaEvalOnVM > > The directory is named batch-apps, not batch_apps. > > Re the "A" version - is this the "good" version, so I could simply > download tika-app and put it there? Or just build tika with a specific > PDFBox version? > > Tilman > > Am 23.10.2017 um 20:54 schrieb Allison, Timothy B.: >> All, >> >> If anyone would like to join the fun in running tika-eval on the >> Rackspace vm, I posted this: >> https://wiki.apache.org/tika/TikaEvalOnVM . You’ll need access to >> the vm, of course, but I’m happy to grant that to anyone who wants to >> chip in and help with regression tests. There are some areas for >> improvements in the process and documentation. 😊 >> >> Cheers, >> >> Tim >> >> P.S. For those who used the vm earlier and found it wonky, it was >> indeed wonky because I had failed to add a swap file. With that >> change in place, the vm works quite well. >> >> >> > > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: [VOTE] Release Apache PDFBox 2.0.8
+1 Thank you, Andreas, Tilman, and team! Cheers, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, October 30, 2017 3:57 PM To: dev@pdfbox.apache.org Subject: Re: [VOTE] Release Apache PDFBox 2.0.8 Am 30.10.2017 um 19:47 schrieb Andreas Lehmkuehler: > Hi, > > a candidate for the PDFBox 2.0.8 release is available at: > > https://dist.apache.org/repos/dist/dev/pdfbox/2.0.8/ > > The release candidate is a zip archive of the sources in: > > http://svn.apache.org/repos/asf/pdfbox/tags/2.0.8/ > > The SHA1 checksum of the archive is > 5c0607144dde1b7af3dd428cafbd2c9c29617ab3. > > Please vote on releasing this package as Apache PDFBox 2.0.8. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 PDFBox PMC votes are cast. > > [ ] +1 Release this package as Apache PDFBox 2.0.8 > [ ] -1 Do not release this package because... > > > Here is my +1 +1 Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
Results: http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take5.tar.gz Haven't had a chance to review, nor have I had a chance to add the extra columns I promised. ☹ -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Thursday, October 26, 2017 1:26 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 26.10.2017 um 19:12 schrieb Andreas Lehmkuehler: > Thanks Tim, looked promising. > > I'm planing to cut my second attempt next monday, if no one objects. +1 Tilman > > @Tim I don't expect any new regressions, but if you have some cycles, > you might kick of another run. > > Andreas > > Am 23.10.2017 um 20:11 schrieb Allison, Timothy B.: >> Reports here: >> http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz >> >> I haven't looked yet. >> >> -Original Message- >> From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] >> Sent: Sunday, October 22, 2017 4:15 PM >> To: dev@pdfbox.apache.org >> Subject: Re: 2.0.8? >> >> @Tim I've fixed the last open regression in 2.0.8, Tilmans test run >> hasn't >> showed any regression. Please re-run your tests again to see if we >> can proceed >> with 2.0.8, I'd really like to push it out. >> >> TIA again, >> Andreas >> >> >> Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler: >>> Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.: >>>> >>>>> And yes, we need another regressions run if possible >>>> >>>> Sounds good. Will do once I hear that we're good to go. Thank you! >>> We are good now. >>> >>> @Tim: Could you please re-run your test to see how good we are? >>> >>> TIA, >>> Andreas >>> >>>> >>>> - >>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org >>>> >>> >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >>> For additional commands, e-mail: dev-h...@pdfbox.apache.org >>> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: dev-h...@pdfbox.apache.org >> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
+1 Will do. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Thursday, October 26, 2017 1:12 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Thanks Tim, looked promising. I'm planing to cut my second attempt next monday, if no one objects. @Tim I don't expect any new regressions, but if you have some cycles, you might kick of another run. Andreas Am 23.10.2017 um 20:11 schrieb Allison, Timothy B.: > Reports here: > http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz > > I haven't looked yet. > > -Original Message- > From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] > Sent: Sunday, October 22, 2017 4:15 PM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.8? > > @Tim I've fixed the last open regression in 2.0.8, Tilmans test run > hasn't showed any regression. Please re-run your tests again to see if > we can proceed with 2.0.8, I'd really like to push it out. > > TIA again, > Andreas > > > Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler: >> Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.: >>> >>>> And yes, we need another regressions run if possible >>> >>> Sounds good. Will do once I hear that we're good to go. Thank you! >> We are good now. >> >> @Tim: Could you please re-run your test to see how good we are? >> >> TIA, >> Andreas >> >>> >>> >>> - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >>> additional commands, e-mail: dev-h...@pdfbox.apache.org >>> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >> additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Running tika-eval on the Rackspace vm
All, If anyone would like to join the fun in running tika-eval on the Rackspace vm, I posted this: https://wiki.apache.org/tika/TikaEvalOnVM . You’ll need access to the vm, of course, but I’m happy to grant that to anyone who wants to chip in and help with regression tests. There are some areas for improvements in the process and documentation. 😊 Cheers, Tim P.S. For those who used the vm earlier and found it wonky, it was indeed wonky because I had failed to add a swap file. With that change in place, the vm works quite well.
RE: 2.0.8?
Reports here: http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz I haven't looked yet. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Sunday, October 22, 2017 4:15 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? @Tim I've fixed the last open regression in 2.0.8, Tilmans test run hasn't showed any regression. Please re-run your tests again to see if we can proceed with 2.0.8, I'd really like to push it out. TIA again, Andreas Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler: > Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.: >> >>> And yes, we need another regressions run if possible >> >> Sounds good. Will do once I hear that we're good to go. Thank you! > We are good now. > > @Tim: Could you please re-run your test to see how good we are? > > TIA, > Andreas > >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
Kicked off process. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Sunday, October 22, 2017 4:15 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? @Tim I've fixed the last open regression in 2.0.8, Tilmans test run hasn't showed any regression. Please re-run your tests again to see if we can proceed with 2.0.8, I'd really like to push it out. TIA again, Andreas Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler: > Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.: >> >>> And yes, we need another regressions run if possible >> >> Sounds good. Will do once I hear that we're good to go. Thank you! > We are good now. > > @Tim: Could you please re-run your test to see how good we are? > > TIA, > Andreas > >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >> additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
> However, PDFBox 2.0.8-SNAPSHOT has a more 0, 1, 2 and 3s... > > The TOP_10_MORE_IN_B column in the contents report shows that there are 15 > more 0's, 15 more 1's 11 more '2's etc. > > 0: 15 | 1: 15 | 2: 11 | 20: 5 | 3: 2 | 4: 2 >Yeah but where do they come from? Not from the pure text extraction. In the >json files, I see that there are >many "0:", "1:" in the new file. I wonder if this is about acroform fiels? Can >be seen e.g. near for >b12c96nfdate36. Sorry, right, AcroForm. We're now getting some children we weren't before. 2.0.8-SNAPSHOT: @@b12c96nfdate362: 0: 1: 2: 20 b12c96nfdate362: 20 2.0.7: @@b12c96nfdate362: b12c96nfdate362: 20
RE: 2.0.8?
If we're talking about the same file...same number of pages, attachments and common words. However, PDFBox 2.0.8-SNAPSHOT has a more 0, 1, 2 and 3s... The TOP_10_MORE_IN_B column in the contents report shows that there are 15 more 0's, 15 more 1's 11 more '2's etc. 0: 15 | 1: 15 | 2: 11 | 20: 5 | 3: 2 | 4: 2 -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, October 10, 2017 11:47 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 09.10.2017 um 22:26 schrieb Allison, Timothy B.: > Thank you, Andreas, for fixing the slow parse on corrupt file so quickly! > > Reports are here: > http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz > Tim, can you please find out what we lost with 254348.pdf? It's not in the text extraction, so I assume it's some meta data but I don't see where. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
Sorry. I just saw this. I ln'd the json extracts so that you can pull them easily: http://162.242.228.174/extracts/pdfbox_2_0_7/ http://162.242.228.174/extracts/pdfbox_2_0_8-SNAPSHOT/ I'll take a look 254348.pdf -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, October 10, 2017 11:47 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 09.10.2017 um 22:26 schrieb Allison, Timothy B.: > Thank you, Andreas, for fixing the slow parse on corrupt file so quickly! > > Reports are here: > http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz > Tim, can you please find out what we lost with 254348.pdf? It's not in the text extraction, so I assume it's some meta data but I don't see where. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
Apologies, but I haven't gotten around to adding the exception columns in the content comparison tables, including the "page count diffs" table. I also haven't had a chance to read/make sense of the reports yet, but I wanted to share asap. Best, Tim -Original Message- From: Allison, Timothy B. Sent: Monday, October 9, 2017 4:26 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.8? Thank you, Andreas, for fixing the slow parse on corrupt file so quickly! Reports are here: http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, October 9, 2017 8:02 AM To: dev@pdfbox.apache.org Subject: RE: 2.0.8? Starting process now. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Sunday, October 8, 2017 10:12 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.: > >> And yes, we need another regressions run if possible > > Sounds good. Will do once I hear that we're good to go. Thank you! We are good now. @Tim: Could you please re-run your test to see how good we are? TIA, Andreas > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
Thank you, Andreas, for fixing the slow parse on corrupt file so quickly! Reports are here: http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, October 9, 2017 8:02 AM To: dev@pdfbox.apache.org Subject: RE: 2.0.8? Starting process now. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Sunday, October 8, 2017 10:12 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.: > >> And yes, we need another regressions run if possible > > Sounds good. Will do once I hear that we're good to go. Thank you! We are good now. @Tim: Could you please re-run your test to see how good we are? TIA, Andreas > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
Starting process now. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Sunday, October 8, 2017 10:12 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.: > >> And yes, we need another regressions run if possible > > Sounds good. Will do once I hear that we're good to go. Thank you! We are good now. @Tim: Could you please re-run your test to see how good we are? TIA, Andreas > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
>And yes, we need another regressions run if possible Sounds good. Will do once I hear that we're good to go. Thank you! - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
>>Let me know when we're ready for another round. >I've already started ... RC2? No need for another regression run? Thank you again!
Re: 2.0.8?
All, Again, my apologies for post-useful/late results! Ugh... Thank you, Andreas and Tilman! Let me know when we're ready for another round. Cheers, Tim -Original Message- From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org] Sent: Tuesday, October 3, 2017 8:23 AM To: dev@pdfbox.apache.org Subject: [jira] [Resolved] (PDFBOX-3949) NPE in bfSearchForObjStreams [ https://issues.apache.org/jira/browse/PDFBOX-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler resolved PDFBOX-3949. Resolution: Fixed Fix Version/s: 3.0.0 2.0.8 I've optimized the brute force search for object streams. Thanks [~talli...@mitre.org] and [~tilman] for the finding > NPE in bfSearchForObjStreams > > > Key: PDFBOX-3949 > URL: https://issues.apache.org/jira/browse/PDFBOX-3949 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.8 >Reporter: Tilman Hausherr >Assignee: Andreas Lehmkühler > Labels: regression > Fix For: 2.0.8, 3.0.0 > > Attachments: MKFYUGZWS3OPXLLVU2Z4LWCTVA5WNOGF.pdf > > > {code} > java.lang.NullPointerException: null > > org.apache.pdfbox.pdfparser.COSParser.bfSearchForObjStreams(COSParser.java:1738) > > org.apache.pdfbox.pdfparser.COSParser.bfSearchForObjects(COSParser.java:1529) > > org.apache.pdfbox.pdfparser.COSParser.getBFCOSObjectOffsets(COSParser.java:1445) > > org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:19 > 05) > {code} > This worked in 2.0.7. The exception happens in 39 files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
> Re 308576.pdf: the text extraction has a huge loss, but a manual check shows > it is identical. However that file has the NPE from PDActionURI.getURI(), > could it be that this results in an abort of text extraction? Same for 569017.pdf. Likely. There are two "per file pair contents" files. The one ending with "_ignore_exceptions.xlsx" means that results are not reported if there was an exception caught for one of the files (308576.pdf and 569017.pdf aren't in that file). The other one "*_with_exceptions" includes both. Based on your feedback, I should add 2 boolean cols to "*_with_exceptions.xlsx" for exceptionInA and exceptionInB?
RE: 2.0.8?
Sorry all for taking longer than expected! File under "this information would have been useful..." ☹ -Original Message----- From: Allison, Timothy B. Sent: Monday, October 2, 2017 3:59 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.8? Reports are here: http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take2.tar.gz Looks like some new NPEs. I'll take a look at the metadata diffs. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, October 2, 2017 9:24 AM To: dev@pdfbox.apache.org Subject: RE: 2.0.8? >>>Email originates from a non-MITRE system. Use caution.<<< Sounds good. I kicked off the eval process yesterday, but because of a bug in our config-file reader and/or user error in modifying the config file, I wound up with 500k pdfs parsed by our EmptyParserno results. I restarted the eval process just now. I should have results in 6 hours. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Sunday, October 1, 2017 6:31 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 25.09.2017 um 18:39 schrieb Andreas Lehmkuehler: > Am 25.09.2017 um 12:30 schrieb Maruan Sahyoun: >> Hi, >>>> Andreas Lehmkuehler hat am 13. September 2017 um >>>> 20:33 >>>> geschrieben: >>>> >>>> >>>> Due to the responses I'm planning to cut the release on Monday the >>>> 25th >>> >>> I'm still working on a solution for PDFBOX-3934 to avoid the >>> regression with PDFBOX-3318. Should we postpone the release for a >>> couple of days or a week max? Or should I simply revert my changes? >> >> I'd go for postponing in order to fix that regression - what about >> setting the date to next Monday? > OK, let's postpone, I'm targeting next Monday. Thanks for your > patience ;-) Just a friendly reminder, I'm going to cut the release in about 30 hours from now. Andreas > > Andreas >> >> BR >> Maruan >> >>> >>> WDYT? >>> > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
Reports are here: http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take2.tar.gz Looks like some new NPEs. I'll take a look at the metadata diffs. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, October 2, 2017 9:24 AM To: dev@pdfbox.apache.org Subject: RE: 2.0.8? >>>Email originates from a non-MITRE system. Use caution.<<< Sounds good. I kicked off the eval process yesterday, but because of a bug in our config-file reader and/or user error in modifying the config file, I wound up with 500k pdfs parsed by our EmptyParserno results. I restarted the eval process just now. I should have results in 6 hours. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Sunday, October 1, 2017 6:31 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 25.09.2017 um 18:39 schrieb Andreas Lehmkuehler: > Am 25.09.2017 um 12:30 schrieb Maruan Sahyoun: >> Hi, >>>> Andreas Lehmkuehler hat am 13. September 2017 um >>>> 20:33 >>>> geschrieben: >>>> >>>> >>>> Due to the responses I'm planning to cut the release on Monday the >>>> 25th >>> >>> I'm still working on a solution for PDFBOX-3934 to avoid the >>> regression with PDFBOX-3318. Should we postpone the release for a >>> couple of days or a week max? Or should I simply revert my changes? >> >> I'd go for postponing in order to fix that regression - what about >> setting the date to next Monday? > OK, let's postpone, I'm targeting next Monday. Thanks for your > patience ;-) Just a friendly reminder, I'm going to cut the release in about 30 hours from now. Andreas > > Andreas >> >> BR >> Maruan >> >>> >>> WDYT? >>> > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
Sounds good. I kicked off the eval process yesterday, but because of a bug in our config-file reader and/or user error in modifying the config file, I wound up with 500k pdfs parsed by our EmptyParserno results. I restarted the eval process just now. I should have results in 6 hours. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Sunday, October 1, 2017 6:31 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 25.09.2017 um 18:39 schrieb Andreas Lehmkuehler: > Am 25.09.2017 um 12:30 schrieb Maruan Sahyoun: >> Hi, Andreas Lehmkuehler hat am 13. September 2017 um 20:33 geschrieben: Due to the responses I'm planning to cut the release on Monday the 25th >>> >>> I'm still working on a solution for PDFBOX-3934 to avoid the >>> regression with PDFBOX-3318. Should we postpone the release for a >>> couple of days or a week max? Or should I simply revert my changes? >> >> I'd go for postponing in order to fix that regression - what about >> setting the date to next Monday? > OK, let's postpone, I'm targeting next Monday. Thanks for your > patience ;-) Just a friendly reminder, I'm going to cut the release in about 30 hours from now. Andreas > > Andreas >> >> BR >> Maruan >> >>> >>> WDYT? >>> > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
> I'd go for postponing in order to fix that regression - what about setting > the date to next Monday? +1 I’m happy pushing it out later if the fix happens >= Friday and we want to run the full regression tests again. Thank you, Andreas! - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.8?
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8-SNAPSHOT_reports.tar.gz is now available. I haven't yet had a chance to look at either... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, September 18, 2017 12:51 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.8? Reports for 2.0.4 vs 2.0.8-SNAPSHOT (r1808067) are available: http://162.242.228.174/reports/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports.tar.gz I'll post 2.0.7 vs 2.0.8-SNAPSHOT in the next few hours. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Wednesday, September 13, 2017 2:33 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Due to the responses I'm planning to cut the release on Monday the 25th Andreas Am 12.09.2017 um 06:43 schrieb Andreas Lehmkuehler: > Good idea, there are already a lot of solved tickets for 2.0.8 > > @all Is there anything pending which should be included? > > How about cutting the release in a week or two from now? > > @Tim please run a test 2.0.7 vs. 2.0.8 if possible > > Andreas > > Am 11.09.2017 um 23:24 schrieb Allison, Timothy B.: >>> I hope there aren't any new regressions. >> >> Happy to help find them! :) >> >> On a related note, do we have a sense of the schedule for PDFBox >> 2.0.8? I'd like to include it in Tika's last Java 7 release...end of >> Sept, middle of Oct., or whenever 2.0.8 is out. :) >> >> >> -Original Message- >> From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org] >> Sent: Monday, September 11, 2017 4:52 PM >> To: dev@pdfbox.apache.org >> Subject: [jira] [Comment Edited] (PDFBOX-3928) >> IllegalArgumentException: root cannot be null with truncated file >> >> >> [ >> https://issues.apache.org/jira/browse/PDFBOX-3928?page=com.atlassian. >> jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1 >> 6161965#comment-16161965 >> ] >> >> Andreas Lehmkühler edited comment on PDFBOX-3928 at 9/11/17 8:51 PM: >> - >> >> Both case are tricky (PDFBOX-3798 is truncated within an object and >> the attached pdf has a truncated xref table), so that I had to >> improve the brute force search one more time. >> [~tilman] thanks for the finding. I hope there aren't any new regressions. >> >> >> was (Author: lehmi): >> Both case are tricky, so that I had to improve the brute force search >> one more time. >> [~tilman] thanks for the finding. I hope there aren't any new regressions. >> >>> IllegalArgumentException: root cannot be null with truncated file >>> - >>> >>> Key: PDFBOX-3928 >>> URL: >>> https://issues.apache.org/jira/browse/PDFBOX-3928 >>> Project: PDFBox >>> Issue Type: Bug >>> Components: Parsing >>> Affects Versions: 2.0.7 >>> Reporter: Tilman Hausherr >>> Assignee: Andreas Lehmkühler >>> Labels: regression >>> Fix For: 2.0.8, 3.0.0 >>> >>> Attachments: 023505.pdf >>> >>> >>> {code} >>> java.lang.IllegalArgumentException: root cannot be null >>> org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75) >>> >>> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatal >>> og.java:129) >>> >>> org.apache.pdfbox.pdmodel.PDDocument.getPages(PDDocument.java:1388) >>> >>> org.apache.pdfbox.debugger.ui.DocumentEntry.getPageCount(DocumentEnt >>> ry.java:42) >>> >>> org.apache.pdfbox.debugger.ui.PDFTreeModel.getChildCount(PDFTreeMode >>> l.java:195) >>> java.desktop/java.beans.PropertyChangeSupport.fire(Unknown >>> Source) >>> >>> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unk >>> nown >>> Source) >>> >>> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unk >>> nown >>> Source) >>> >>> org.apache.pdfbox.debugger.PDFDebugger.initTree(PDFDebugger.java:128 >>> 8) >>> >>> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java: >>> 1235) >>> >>> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java: >>> 12
RE: 2.0.8?
Reports for 2.0.4 vs 2.0.8-SNAPSHOT (r1808067) are available: http://162.242.228.174/reports/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports.tar.gz I'll post 2.0.7 vs 2.0.8-SNAPSHOT in the next few hours. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Wednesday, September 13, 2017 2:33 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Due to the responses I'm planning to cut the release on Monday the 25th Andreas Am 12.09.2017 um 06:43 schrieb Andreas Lehmkuehler: > Good idea, there are already a lot of solved tickets for 2.0.8 > > @all Is there anything pending which should be included? > > How about cutting the release in a week or two from now? > > @Tim please run a test 2.0.7 vs. 2.0.8 if possible > > Andreas > > Am 11.09.2017 um 23:24 schrieb Allison, Timothy B.: >>> I hope there aren't any new regressions. >> >> Happy to help find them! :) >> >> On a related note, do we have a sense of the schedule for PDFBox >> 2.0.8? I'd like to include it in Tika's last Java 7 release...end of >> Sept, middle of Oct., or whenever 2.0.8 is out. :) >> >> >> -Original Message- >> From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org] >> Sent: Monday, September 11, 2017 4:52 PM >> To: dev@pdfbox.apache.org >> Subject: [jira] [Comment Edited] (PDFBOX-3928) >> IllegalArgumentException: root cannot be null with truncated file >> >> >> [ >> https://issues.apache.org/jira/browse/PDFBOX-3928?page=com.atlassian. >> jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1 >> 6161965#comment-16161965 >> ] >> >> Andreas Lehmkühler edited comment on PDFBOX-3928 at 9/11/17 8:51 PM: >> - >> >> Both case are tricky (PDFBOX-3798 is truncated within an object and >> the attached pdf has a truncated xref table), so that I had to >> improve the brute force search one more time. >> [~tilman] thanks for the finding. I hope there aren't any new regressions. >> >> >> was (Author: lehmi): >> Both case are tricky, so that I had to improve the brute force search >> one more time. >> [~tilman] thanks for the finding. I hope there aren't any new regressions. >> >>> IllegalArgumentException: root cannot be null with truncated file >>> - >>> >>> Key: PDFBOX-3928 >>> URL: >>> https://issues.apache.org/jira/browse/PDFBOX-3928 >>> Project: PDFBox >>> Issue Type: Bug >>> Components: Parsing >>> Affects Versions: 2.0.7 >>> Reporter: Tilman Hausherr >>> Assignee: Andreas Lehmkühler >>> Labels: regression >>> Fix For: 2.0.8, 3.0.0 >>> >>> Attachments: 023505.pdf >>> >>> >>> {code} >>> java.lang.IllegalArgumentException: root cannot be null >>> org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75) >>> >>> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatal >>> og.java:129) >>> >>> org.apache.pdfbox.pdmodel.PDDocument.getPages(PDDocument.java:1388) >>> >>> org.apache.pdfbox.debugger.ui.DocumentEntry.getPageCount(DocumentEnt >>> ry.java:42) >>> >>> org.apache.pdfbox.debugger.ui.PDFTreeModel.getChildCount(PDFTreeMode >>> l.java:195) >>> java.desktop/java.beans.PropertyChangeSupport.fire(Unknown >>> Source) >>> >>> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unk >>> nown >>> Source) >>> >>> java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unk >>> nown >>> Source) >>> >>> org.apache.pdfbox.debugger.PDFDebugger.initTree(PDFDebugger.java:128 >>> 8) >>> >>> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java: >>> 1235) >>> >>> org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java: >>> 1218) >>> >>> org.apache.pdfbox.debugger.PDFDebugger.main(PDFDebugger.java:1209) >>> org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:85) >>> {code} >>> This worked in 2.0.6, but no longer in 2.0.7. It happens since [ >>> https://svn.apache.org/r1795705 ] of PDFBOX-3798. >> >
RE: 2.0.8?
> because I'm ill but I expect to be my old self later this week. I'm sorry to hear it! I hope that you are feeling better soon! > I'd also like to have a test from version 2.0.4 compared to trunk because > 2.0.5 was the version were the tests weren't done, the problems were fixed in > 2.0.6 but at that time we tested only 2.0.5 against 2.0.6. I was just thinking the same thing, but without the specific versions in mind. :) Great idea. Will do over the next week... - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
2.0.8?
>I hope there aren't any new regressions. Happy to help find them! :) On a related note, do we have a sense of the schedule for PDFBox 2.0.8? I'd like to include it in Tika's last Java 7 release...end of Sept, middle of Oct., or whenever 2.0.8 is out. :) -Original Message- From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org] Sent: Monday, September 11, 2017 4:52 PM To: dev@pdfbox.apache.org Subject: [jira] [Comment Edited] (PDFBOX-3928) IllegalArgumentException: root cannot be null with truncated file [ https://issues.apache.org/jira/browse/PDFBOX-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161965#comment-16161965 ] Andreas Lehmkühler edited comment on PDFBOX-3928 at 9/11/17 8:51 PM: - Both case are tricky (PDFBOX-3798 is truncated within an object and the attached pdf has a truncated xref table), so that I had to improve the brute force search one more time. [~tilman] thanks for the finding. I hope there aren't any new regressions. was (Author: lehmi): Both case are tricky, so that I had to improve the brute force search one more time. [~tilman] thanks for the finding. I hope there aren't any new regressions. > IllegalArgumentException: root cannot be null with truncated file > - > > Key: PDFBOX-3928 > URL: https://issues.apache.org/jira/browse/PDFBOX-3928 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.7 >Reporter: Tilman Hausherr >Assignee: Andreas Lehmkühler > Labels: regression > Fix For: 2.0.8, 3.0.0 > > Attachments: 023505.pdf > > > {code} > java.lang.IllegalArgumentException: root cannot be null > org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75) > > org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:129) > org.apache.pdfbox.pdmodel.PDDocument.getPages(PDDocument.java:1388) > > org.apache.pdfbox.debugger.ui.DocumentEntry.getPageCount(DocumentEntry.java:42) > > org.apache.pdfbox.debugger.ui.PDFTreeModel.getChildCount(PDFTreeModel.java:195) > java.desktop/java.beans.PropertyChangeSupport.fire(Unknown Source) > java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unknown > Source) > java.desktop/java.beans.PropertyChangeSupport.firePropertyChange(Unknown > Source) > org.apache.pdfbox.debugger.PDFDebugger.initTree(PDFDebugger.java:1288) > org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1235) > org.apache.pdfbox.debugger.PDFDebugger.readPDFFile(PDFDebugger.java:1218) > org.apache.pdfbox.debugger.PDFDebugger.main(PDFDebugger.java:1209) > org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:85) > {code} > This worked in 2.0.6, but no longer in 2.0.7. It happens since [ > https://svn.apache.org/r1795705 ] of PDFBOX-3798. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively
Thank you, Maruan! I opened PDFBOX-3898 after breaking out the spec...I may be misreading it, tho! -Original Message- From: Maruan Sahyoun [mailto:sahy...@fileaffairs.de] Sent: Tuesday, August 15, 2017 11:58 AM To: dev@pdfbox.apache.org Subject: Re: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively Hi Tim, > Am 15.08.2017 um 17:31 schrieb Allison, Timothy B. : > > All, > I can't tell if the triggering file is corrupt or how we want to handle it > on the PDFBox side. The problem is that the parent node is a PDTextField -- > a PDTerminalField -- so we don't/can't look for children, even though it > actually does have pointers in Kids. I had a quick look with the debugger and the file looks fine. There is nothing wrong with a non terminal field having a field type /FT and the kids (terminal fields) having not. In such case the field type should be taken for the kids. Which vesion of PDFBox is Tika 1.14 on? BR Maruan > > The output from PrintFields is: > > 1 top-level fields were found on the form > |--parent.parent = , > |type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField > > -Original Message- > From: Tim Allison (JIRA) [mailto:j...@apache.org] > Sent: Monday, August 14, 2017 10:36 AM > To: d...@tika.apache.org > Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form > fields not handled recursively > > >[ > https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jir > a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125 > 756#comment-16125756 ] > >> Non-terminal interactive form fields not handled recursively >> >> >>Key: TIKA-2442 >>URL: https://issues.apache.org/jira/browse/TIKA-2442 >>Project: Tika >> Issue Type: Bug >> Components: parser >> Affects Versions: 1.14 >> Reporter: Christopher Creutzig >>Attachments: simple-form.pdf >> >> >> (I am not sure if this is a Tika or a PDFBox problem; I tried finding >> a form extractor in PDFBox, but the app api does not have one. PDFDebugger >> does show me the expected tree structure.) The attached PDF has a >> non-terminal field named “parent” and two children, “child1” and “child2.” >> According to the PDF spec in section 8.6, the fully qualified field names >> should be parent.child1 and parent.child2. That is the output given by pdftk: >>> pdftk simple-form.pdf dump_data_fields >> --- >> FieldType: Text >> FieldName: parent.child1 >> FieldFlags: 0 >> FieldValue: child1 value >> FieldJustification: Left >> --- >> FieldType: Text >> FieldName: parent.child2 >> FieldFlags: 0 >> FieldValue: child2 value >> FieldJustification: Left >> Tika with the ToXMLContentHandler seems to silently ignore the children, >> however, returning only a parent with no value. >> Calling code: >> import java.io.FileInputStream; >> import org.apache.tika.detect.DefaultDetector; >> import org.apache.tika.detect.Detector; import >> org.apache.tika.metadata.Metadata; >> import org.apache.tika.parser.AutoDetectParser; >> import org.apache.tika.parser.ParseContext; >> import org.apache.tika.parser.Parser; import >> org.apache.tika.parser.PasswordProvider; >> import org.apache.tika.sax.ToXMLContentHandler; >> class readAsXHTML { >> public static String readAsXHTML(String filename) throws Exception { >>ToXMLContentHandler handler = new ToXMLContentHandler(); >>Detector detector = new DefaultDetector(); >>Parser parser = new AutoDetectParser(detector); >>ParseContext context = new ParseContext(); >>Metadata metadata = new Metadata(); >>FileInputStream fh = null; >>final String pass = password; >>try { >> fh = new FileInputStream(filename); >> parser.parse(fh, handler, metadata, context); >> >> return(handler.toString()); >>} >>finally { >> if (fh != null) { >>fh.close(); >> } >>} >> } >> } >> Abbreviated output: >> >> >>parent: >> >> >> >> Expected: >> >> >> >> parent.child1: child1 value >> parent.child2: child2 value > > > > -- > This message was sent by Atlassian JIRA > (v6.4.14#64029) > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
FW: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively
All, I can't tell if the triggering file is corrupt or how we want to handle it on the PDFBox side. The problem is that the parent node is a PDTextField -- a PDTerminalField -- so we don't/can't look for children, even though it actually does have pointers in Kids. The output from PrintFields is: 1 top-level fields were found on the form |--parent.parent = , type=org.apache.pdfbox.pdmodel.interactive.form.PDTextField -Original Message- From: Tim Allison (JIRA) [mailto:j...@apache.org] Sent: Monday, August 14, 2017 10:36 AM To: d...@tika.apache.org Subject: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively [ https://issues.apache.org/jira/browse/TIKA-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125756#comment-16125756 ] > Non-terminal interactive form fields not handled recursively > > > Key: TIKA-2442 > URL: https://issues.apache.org/jira/browse/TIKA-2442 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Christopher Creutzig > Attachments: simple-form.pdf > > > (I am not sure if this is a Tika or a PDFBox problem; I tried finding > a form extractor in PDFBox, but the app api does not have one. PDFDebugger > does show me the expected tree structure.) The attached PDF has a > non-terminal field named “parent” and two children, “child1” and “child2.” > According to the PDF spec in section 8.6, the fully qualified field names > should be parent.child1 and parent.child2. That is the output given by pdftk: > > pdftk simple-form.pdf dump_data_fields > --- > FieldType: Text > FieldName: parent.child1 > FieldFlags: 0 > FieldValue: child1 value > FieldJustification: Left > --- > FieldType: Text > FieldName: parent.child2 > FieldFlags: 0 > FieldValue: child2 value > FieldJustification: Left > Tika with the ToXMLContentHandler seems to silently ignore the children, > however, returning only a parent with no value. > Calling code: > import java.io.FileInputStream; > import org.apache.tika.detect.DefaultDetector; > import org.apache.tika.detect.Detector; import > org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.PasswordProvider; > import org.apache.tika.sax.ToXMLContentHandler; > class readAsXHTML { > public static String readAsXHTML(String filename) throws Exception { > ToXMLContentHandler handler = new ToXMLContentHandler(); > Detector detector = new DefaultDetector(); > Parser parser = new AutoDetectParser(detector); > ParseContext context = new ParseContext(); > Metadata metadata = new Metadata(); > FileInputStream fh = null; > final String pass = password; > try { > fh = new FileInputStream(filename); > parser.parse(fh, handler, metadata, context); > > return(handler.toString()); > } > finally { > if (fh != null) { > fh.close(); > } > } > } > } > Abbreviated output: > > > parent: > > > > Expected: > > > > parent.child1: child1 value > parent.child2: child2 value -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
FW: Tika content detection and crawled "remote" content
All, > If anyone is interested in using the detected MIME types or anything else > from Common Crawl - I'm happy to help! The URL index [4] contains now a new > field "mime-detected" which makes it easy to search or grep for confusion > pairs. This is an amazing step forward for sampling PDF files from Common Crawl. I used to rely on the http-headers and/or file suffix, but now we also have Tika's judgment on every file in Common Crawl. We still have to deal with the 1MB truncation (I think), but this is an amazing development. Thank you, Sebastian! Cheers, Tim -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] Sent: Tuesday, July 4, 2017 6:18 AM To: u...@tika.apache.org Subject: Tika content detection and crawled "remote" content Hi, recently I've plugged in Tika's content detection into Common Crawl's crawler (modified Nutch) with the target to get clean and correct MIME type - the HTTP Content-Type may contain garbage and isn't always correct [1]. For the June 2017 crawl I've prepared a comparison of content types sent by the server in the HTTP header and as detected by Tika 1.15 [2]. It shows that content types by Tika are definitely clean (1,400 different content types vs. more than 6,000 content type "strings" from HTTP headers). A look on the "confusions" where Content-Type and Tika differ, shows a mixed picture: some pairs are plausible, e.g., if Tika changes the type to a more precise subtype or detects the MIME at all: Tika-1.15HTTP-Content-Type 1001968023 application/xhtml+xmltext/html 2298146 application/rss+xml text/xml 617435 application/rss+xml application/xml 613525 text/htmlunk 361525 application/xhtml+xmlunk 297707 application/rdf+xml application/xml However, there are a few dubious decisions, esp. the group of web server-side scripting languages (ASP, JSP, PHP, ColdFusion, etc.): Tika-1.15 HTTP-Content-Type 2047739 text/x-phptext/html 681629 text/asp text/html 193095 text/x-coldfusion text/html 172318 text/aspdotnettext/html 139033 text/x-jsptext/html 38415 text/x-cgitext/html 32092 text/x-phptext/xml 18021 text/x-perl text/html Of course, due to misconfigurations some servers may deliver the script files unmodified but in general I wouldn't expect that this happens for millions of pages. I've checked some of the affected URLs: - HTML fragment (no declaration of or opening tag) https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0 http://www.privi.com/product-details.asp?cno=C10910011 http://mental-ray.de/Root_alt/Default.asp http://ekyrs.org/support/index.php?action=profile http://cwmorse.eu5.org/lineal/mostrar.php?contador=200 - (overlong) comment block at start of HTML which "masks" the HTML declaration http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24 http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6 https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php https://de.e-stories.org/categories.php?&lan=nl&art=p - HTML with some scripting fragments ("") present: http://www.eco-ani-yao.org/shien/ - others are clearly HTML (looks more like a bug, at least, there is no simple explanation) http://www.proedinc.com/customer/content.aspx?redid=9 http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79 http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79 Obviously certain file suffixes (.php, .aspx) should get less weight compared to Content-Type sent from the responding server. Now my question: where's the best place to fix this: in the crawler [3] or in Tika? If anyone is interested in using the detected MIME types or anything else from Common Crawl - I'm happy to help! The URL index [4] contains now a new field "mime-detected" which makes it easy to search or grep for confusion pairs. Thanks and best, Sebastian [1] https://github.com/commoncrawl/nutch/issues/3 [2] s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz [3] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152 [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: d
RE: tika-eval
Ha. I hadn't realized the video was available until this post. Thank you! > And here is the talk about it Tim gave at ApacheCon > > https://youtu.be/vRPTPMwI53k?list=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp > > I've enjoyed it (the video). So did I! Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: [VOTE] Release Apache PDFBox 2.0.6
+1 Thank you! -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Friday, May 12, 2017 12:13 PM To: dev@pdfbox.apache.org Subject: [VOTE] Release Apache PDFBox 2.0.6 Hi, a candidate for the PDFBox 2.0.6 release is available at: https://dist.apache.org/repos/dist/dev/pdfbox/2.0.6/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/pdfbox/tags/2.0.6/ The SHA1 checksum of the archive is cb04fa19058efca6913a45490ac66cf44ecf273a. Please vote on releasing this package as Apache PDFBox 2.0.6. The vote is open for the next 72 hours and passes if a majority of at least three +1 PDFBox PMC votes are cast. [ ] +1 Release this package as Apache PDFBox 2.0.6 [ ] -1 Do not release this package because... Here is my +1 BR Andreas Lehmkühler - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz Looks good to me on a very cursory look.
RE: 2.0.6 release ?
> It isn't that secret as Tim posted it somewhere in this thread :) I've added throttling to httpd (I think) so we should be ok, and y, the address is out in the open now. Let me know if I should kick off another run. Thank you, all! - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Haven't had a chance to look. Reports are here: http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz
RE: 2.0.6 release ?
I won't have results immediately. :) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 4:13 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.: > UGH. I'm so wrong. I accidentally had a 2.0.4.jar in my app/target... > > > > Off we go? Yes! However it's 10pm here, so I won't be able to react to the results immediately. Tilman > > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, May 9, 2017 3:49 PM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > You caught me... I haven't checked these yet. > > But I did now, with > MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf > 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx > IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx > but they don't throw an NPE anymore now. > > Oops... I see I have that check you mention in my code, it has been there for > months and I forgot to make an issue. But after removing it, it still works > with the three files... so the question is, can this parameter ever be null, > or not? > > Tilman > > Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: >> Should we return false if the node is null in PDPageTree#isPageTreeNode (66 >> new NPE exceptions)? Has this been fixed, or would that cause unintended >> problems? >> >> /** >>* Returns true if the node is a page tree node (i.e. and >> intermediate). >>*/ >> private boolean isPageTreeNode(COSDictionary node ) >> { >> // some files such as PDFBOX-2250-229205.pdf don't have Pages set >> as the Type, so we have >> // to check for the presence of Kids too >> return node.getCOSName(COSName.TYPE) == COSName.PAGES || >> node.containsKey(COSName.KIDS); >> } >> >> -Original Message- >> From: Tilman Hausherr [mailto:thaush...@t-online.de] >> Sent: Tuesday, May 9, 2017 3:20 PM >> To: dev@pdfbox.apache.org >> Subject: Re: 2.0.6 release ? >> >> Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >>>> I've fixed all remaining regression tickets (in the end it was >>>> exactly 1) >>> Great! Thank you! >>> >>> Let me know when I should kick off another eval. >> Yes, please do. >> >> Thanks >> >> Tilman >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >> additional commands, e-mail: dev-h...@pdfbox.apache.org >> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For >> additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > B CB > [ X ܚX KK[XZ[ > ] ][ X ܚX P > \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ > ] Z[ > \X K ܙ B B > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
UGH. I'm so wrong. I accidentally had a 2.0.4.jar in my app/target... Off we go? -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:49 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? You caught me... I haven't checked these yet. But I did now, with MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx but they don't throw an NPE anymore now. Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not? Tilman Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: > Should we return false if the node is null in PDPageTree#isPageTreeNode (66 > new NPE exceptions)? Has this been fixed, or would that cause unintended > problems? > > /** > * Returns true if the node is a page tree node (i.e. and intermediate). > */ > private boolean isPageTreeNode(COSDictionary node ) > { > // some files such as PDFBOX-2250-229205.pdf don't have Pages set as > the Type, so we have > // to check for the presence of Kids too > return node.getCOSName(COSName.TYPE) == COSName.PAGES || > node.containsKey(COSName.KIDS); > } > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, May 9, 2017 3:20 PM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >>> I've fixed all remaining regression tickets (in the end it was >>> exactly 1) >> Great! Thank you! >> >> Let me know when I should kick off another eval. > > Yes, please do. > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B
RE: 2.0.6 release ?
With lots of empty pages... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 3:57 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 release ? Doh. AR can't open it. Sorry. Chrome appears to be able to open it. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 3:56 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 release ? commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K throws NPE and opens without complaint in AR. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:49 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? You caught me... I haven't checked these yet. But I did now, with MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx but they don't throw an NPE anymore now. Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not? Tilman Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: > Should we return false if the node is null in PDPageTree#isPageTreeNode (66 > new NPE exceptions)? Has this been fixed, or would that cause unintended > problems? > > /** > * Returns true if the node is a page tree node (i.e. and intermediate). > */ > private boolean isPageTreeNode(COSDictionary node ) > { > // some files such as PDFBOX-2250-229205.pdf don't have Pages set as > the Type, so we have > // to check for the presence of Kids too > return node.getCOSName(COSName.TYPE) == COSName.PAGES || > node.containsKey(COSName.KIDS); > } > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, May 9, 2017 3:20 PM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >>> I've fixed all remaining regression tickets (in the end it was >>> exactly 1) >> Great! Thank you! >> >> Let me know when I should kick off another eval. > > Yes, please do. > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B
RE: 2.0.6 release ?
Doh. AR can't open it. Sorry. Chrome appears to be able to open it. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 3:56 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 release ? commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K throws NPE and opens without complaint in AR. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:49 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? You caught me... I haven't checked these yet. But I did now, with MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx but they don't throw an NPE anymore now. Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not? Tilman Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: > Should we return false if the node is null in PDPageTree#isPageTreeNode (66 > new NPE exceptions)? Has this been fixed, or would that cause unintended > problems? > > /** > * Returns true if the node is a page tree node (i.e. and intermediate). > */ > private boolean isPageTreeNode(COSDictionary node ) > { > // some files such as PDFBOX-2250-229205.pdf don't have Pages set as > the Type, so we have > // to check for the presence of Kids too > return node.getCOSName(COSName.TYPE) == COSName.PAGES || > node.containsKey(COSName.KIDS); > } > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, May 9, 2017 3:20 PM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >>> I've fixed all remaining regression tickets (in the end it was >>> exactly 1) >> Great! Thank you! >> >> Let me know when I should kick off another eval. > > Yes, please do. > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B�CB��[��X��ܚX�KK[XZ[ �]�][��X��ܚX�P��� �\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[ �]�Z[��� �\X�K�ܙ�B�B
RE: 2.0.6 release ?
commoncrawl2_likely_broken/WL/WL4ZBGPG6543HIT24KCT7XZUIL5NBQ6K throws NPE and opens without complaint in AR. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:49 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? You caught me... I haven't checked these yet. But I did now, with MD6X34Z6CODIJODXTOH5E6WJ7VUUPITO.pdf 3TE3TRHZVL2ZJGUUASZEWNOY6DXRTTEK.ashx IDEJP3MH4FCZNNTWQDMZ6TD2MOIPAYZ7.ashx but they don't throw an NPE anymore now. Oops... I see I have that check you mention in my code, it has been there for months and I forgot to make an issue. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not? Tilman Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: > Should we return false if the node is null in PDPageTree#isPageTreeNode (66 > new NPE exceptions)? Has this been fixed, or would that cause unintended > problems? > > /** > * Returns true if the node is a page tree node (i.e. and intermediate). > */ > private boolean isPageTreeNode(COSDictionary node ) > { > // some files such as PDFBOX-2250-229205.pdf don't have Pages set as > the Type, so we have > // to check for the presence of Kids too > return node.getCOSName(COSName.TYPE) == COSName.PAGES || > node.containsKey(COSName.KIDS); > } > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, May 9, 2017 3:20 PM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >>> I've fixed all remaining regression tickets (in the end it was >>> exactly 1) >> Great! Thank you! >> >> Let me know when I should kick off another eval. > > Yes, please do. > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Should we return false if the node is null in PDPageTree#isPageTreeNode (66 new NPE exceptions)? Has this been fixed, or would that cause unintended problems? /** * Returns true if the node is a page tree node (i.e. and intermediate). */ private boolean isPageTreeNode(COSDictionary node ) { // some files such as PDFBOX-2250-229205.pdf don't have Pages set as the Type, so we have // to check for the presence of Kids too return node.getCOSName(COSName.TYPE) == COSName.PAGES || node.containsKey(COSName.KIDS); } -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:20 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >> I've fixed all remaining regression tickets (in the end it was >> exactly 1) > Great! Thank you! > > Let me know when I should kick off another eval. Yes, please do. Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
>I've fixed all remaining regression tickets (in the end it was exactly 1) Great! Thank you! Let me know when I should kick off another eval. - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Added a page count comparison report under "content/": http://162.242.228.174/reports/reports_pdfbox_2_0_6c.tar.gz -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 2:39 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 release ? http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz Added CONTAINER_LENGTH to reports that have a file path. This is the length in bytes of the container file (as opposed to the embedded file). Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 10:07 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: > Tilman's initial recommendation Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong. Reason: I try to get small files to keep any "examples" for my regression tests. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B�CB��[��X��ܚX�KK[XZ[ �]�][��X��ܚX�P��� �\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[ �]�Z[��� �\X�K�ܙ�B�B - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
http://162.242.228.174/reports/reports_pdfbox_2_0_6b.tar.gz Added CONTAINER_LENGTH to reports that have a file path. This is the length in bytes of the container file (as opposed to the embedded file). Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 10:07 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: > Tilman's initial recommendation Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong. Reason: I try to get small files to keep any "examples" for my regression tests. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Y. Will do. Meetings beckon, so it will take a few hours. :( -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 10:07 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: > Tilman's initial recommendation Can you do me another favor? Have a column with the size in any table that is about individual files. I think it was there in the past, but I may be wrong. Reason: I try to get small files to keep any "examples" for my regression tests. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
For the reports comparing 2.0.3 with 2.0.5, see https://github.com/tballison/share/blob/master/tika_comparisons/reports_1_14V1_15.zip That was a full run against all file types of Tika 1.14 vs 1.15-SNAPSHOT from April 25. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, May 8, 2017 8:43 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 release ? Content 1) To get a _general_ sense of overall content extract, see "content/ common_token_comparisons_by_mime.xlsx" This suggests that we've lost 248k "common words"[1], which out of 2.6 billion isn't much. However, we also lost 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an improvement. 2) If you want to compare content whether or not one there was a parse exception, see "content/content_diffs_with_exceptions.xlsx" 3) If you only want to see content diffs where both extracts did not have an exception, see "content/content_diffs_ignore_exceptions.xlsx". To make quick sense of the content_diffs_files, sort "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files lost the most common tokens. To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which compare the number of unique tokens/tokens in common...a low number means little similarity, while a number close to 1.0 means that the unigrams are nearly identical. From a quick look, many of the files with fewer common words are in the "likely_broken" and or "truncated" subdirectories... Some exceptions to this rule include the following, but there are more...and overall, there is a fair amount of loss from 2.0.3. govdocs1/202/202097.pdf govdocs1/358/358043.pdf commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6 commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56 [1] For this version of tika-eval, I expanded Tilman's initial recommendation of common words for English a bit. I took the top 20k most common words (4 characters or more, except for CJK) for a large number of Wikipedia dumps. I removed common html markup words (body, form, table) so that failure to strip html doesn't incorrectly boost scores. We apply language id and then use the common words for that language. For example, for truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW * PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 tokens from the French list of common words. * PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there were 320 common words from the English list of common words. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, May 8, 2017 10:01 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.: > Happy to. Will kick off now? Yes Tilman > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Saturday, May 6, 2017 10:02 AM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: >> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: >>> Hi, >>> >>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, >>> any objections? >> I'm targeting the 15th or 16th > Tim, could you please run your tests when time allows? > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Content 1) To get a _general_ sense of overall content extract, see "content/ common_token_comparisons_by_mime.xlsx" This suggests that we've lost 248k "common words"[1], which out of 2.6 billion isn't much. However, we also lost 18 million common words going from 2.0.3 (Tika 1.14) to 2.0.5 (Tika 1.15-SNAPSHOT)...so I'd hope the fix to PDFBOX-3717 would have led to an improvement. 2) If you want to compare content whether or not one there was a parse exception, see "content/content_diffs_with_exceptions.xlsx" 3) If you only want to see content diffs where both extracts did not have an exception, see "content/content_diffs_ignore_exceptions.xlsx". To make quick sense of the content_diffs_files, sort "NUM_COMMON_TOKENS_DIFF_IN_B" in ascending order, and you'll see which files lost the most common tokens. To see which files changed the most, sort on DICE_COEFFICIENT or OVERLAP, which compare the number of unique tokens/tokens in common...a low number means little similarity, while a number close to 1.0 means that the unigrams are nearly identical. From a quick look, many of the files with fewer common words are in the "likely_broken" and or "truncated" subdirectories... Some exceptions to this rule include the following, but there are more...and overall, there is a fair amount of loss from 2.0.3. govdocs1/202/202097.pdf govdocs1/358/358043.pdf commoncrawl2/C5/C5FUETRXI26MXZDK4YP5YYQA2N6GHEC6 commoncrawl2/QR/QRGKM44N7J62Y6BZHTP2BC7BCHF3SJ56 [1] For this version of tika-eval, I expanded Tilman's initial recommendation of common words for English a bit. I took the top 20k most common words (4 characters or more, except for CJK) for a large number of Wikipedia dumps. I removed common html markup words (body, form, table) so that failure to strip html doesn't incorrectly boost scores. We apply language id and then use the common words for that language. For example, for truncated_pdfs/commoncrawl2_likely_broken/IA/IA64I4PY77P4IVKTLZ3WHRCNSODW3PZW * PDFBox 2.0.5 extracted text that was id'd as "French", and there were 1580 tokens from the French list of common words. * PDFBox 2.0.6-SNAPSHOT extracted text that was id'd as "English", and there were 320 common words from the English list of common words. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, May 8, 2017 10:01 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.: > Happy to. Will kick off now? Yes Tilman > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Saturday, May 6, 2017 10:02 AM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: >> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: >>> Hi, >>> >>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, >>> any objections? >> I'm targeting the 15th or 16th > Tim, could you please run your tests when time allows? > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org B CB [ X ܚX KK[XZ[ ] ][ X ܚX P \X K ܙ B ܈Y][ۘ[ [X[ K[XZ[ ] Z[ \X K ܙ B B - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: 2.0.6 release ?
Results here: http://162.242.228.174/reports/reports_pdfbox_2_0_6.tar.gz A = 2.0.5 B = 2.0.6-SNAPSHOT from 12 hours ago. I've only had a chance to look at the exceptions, attachments and metadata so far. For the new exceptions (roughly grouped by stacktrace), see "exceptions/new_exceptions_in_B_by_mime_by_stack_trace.xlsx" For the full stack traces and triggering file paths (prepend http://162.242.228.174/docs to retrieve the source files), see "exceptions/new_excetions_in_B_details.xlsx". For the fixed exceptions, see "exceptions/fixed_exceptions_in_B_by_mime.xlsx" and *_details.xlsx. To confirm that the content of from the "fixed exceptions" looks language-y, scan through "exceptions/contents_of_fixed_exceptions_in_B.xlsx". There are few handfuls of diffs in attachments and metadata, and I'll look into these. Off to look at the contents... -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, May 8, 2017 10:01 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.: > Happy to. Will kick off now? Yes Tilman > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Saturday, May 6, 2017 10:02 AM > To: dev@pdfbox.apache.org > Subject: Re: 2.0.6 release ? > > Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: >> Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: >>> Hi, >>> >>> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, >>> any objections? >> I'm targeting the 15th or 16th > Tim, could you please run your tests when time allows? > > Thanks > > Tilman > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: low priority: proxy settings and unit tests?
If there aren't objections, I'll open a ticket and make that change after the 2.0.6 release. Thank you! -Original Message- From: Maruan Sahyoun [mailto:sahy...@fileaffairs.de] Sent: Monday, May 8, 2017 9:39 AM To: dev@pdfbox.apache.org Subject: Re: low priority: proxy settings and unit tests? How about skipping the test on ".. Connection refused .."? BR Maruan > Am 08.05.2017 um 15:36 schrieb Allison, Timothy B. : > > All, > Apologies for this one... Is there an easy way to set proxy information for > the unit tests that get an InputStream via URL without changing any source > code or project poms? In Intellij, I can modify the program arguments for > each one, but then, of course, maven doesn't pick up that information when I > do a build. > > I've been adding @Ignore to the unit tests in my local copy, but there has > to be a better way. > > Failed tests: > PDButtonTest.testRadioButtonWithOptions:131 Unexpected IOException > Connection refused: connect > PDButtonTest.testOptionsAndNamesNotNumbers:187 Unexpected IOException > Connection refused: connect > > Tests in error: > MergeAcroFormsTest.testAPEntry:92 > Connect Connection refused: > connect > MergeAcroFormsTest.testAnnotsEntry:59 > Connect Connection refused: > connect > MergeAnnotationsTest.testLinkAnnotations:61 > Connect Connection refused: > conn... - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
low priority: proxy settings and unit tests?
All, Apologies for this one... Is there an easy way to set proxy information for the unit tests that get an InputStream via URL without changing any source code or project poms? In Intellij, I can modify the program arguments for each one, but then, of course, maven doesn't pick up that information when I do a build. I've been adding @Ignore to the unit tests in my local copy, but there has to be a better way. Failed tests: PDButtonTest.testRadioButtonWithOptions:131 Unexpected IOException Connection refused: connect PDButtonTest.testOptionsAndNamesNotNumbers:187 Unexpected IOException Connection refused: connect Tests in error: MergeAcroFormsTest.testAPEntry:92 > Connect Connection refused: connect MergeAcroFormsTest.testAnnotsEntry:59 > Connect Connection refused: connect MergeAnnotationsTest.testLinkAnnotations:61 > Connect Connection refused: conn...
RE: 2.0.6 release ?
Happy to. Will kick off now? -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, May 6, 2017 10:02 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: > Am 02.05.2017 um 12:42 schrieb Andreas Lehmkühler: >> Hi, >> >> I'm planning to cut a 2.0.6 release in about 1 or 2 weeks from now, >> any objections? > I'm targeting the 15th or 16th Tim, could you please run your tests when time allows? Thanks Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
jai-imageio-core -- BSD-3 with nuclear clause
PDFBox colleagues, On TIKA-2338, we're considering incorporating jai-imageio-core into Tika (removing the "provided" scope) because the authors on github claim that they've removed the non-ASL 2.0 parts out of jai-imageio-core. We noticed, though, that this is BSD-3 with the nuclear clause. We can't find anything about nukes in the usual place[1]. We've opened LEGAL-304. Have you considered this at all? Would you have any insight into whether the nuclear clause is "field of use" (which would mean we could not do this) or "acceptance of no liability" (which would mean we could do this). Thank you. Cheers, Tim [1] https://www.apache.org/legal/resolved.html
RE: [VOTE] Release Apache PDFBox 2.0.5
Tilman and Andreas, thank you for taking a look! I agree no need to stop the release. The improvements far outweigh the small regression. > I had a look at content_diffs_with_exceptions.xlsx, then looking only > at govdocs there, all are similar or better. Y, agreed. Do we care about these likely broken PDFs from which 2.0.4 appears to be able to extract more "common words" than 2.0.5? commoncrawl2_likely_broken/OV/OVWMJPQGCK2AQZYVWJWYUPTERPXOGIAD commoncrawl2_likely_broken/R4/R4P75EJNMNXZC2DQYUFB6BSXQ2CWGVG7.pdf commoncrawl2_likely_broken/BI/BIVJLJ4QULQQ4VHKKNMBUTKWXAMMN53N.pdf commoncrawl2_likely_broken/LB/LB6LEZ75Y6OL7SGW7SV6JNO4G6FS7HAS commoncrawl2_likely_broken/LQ/LQQFDYEI7XTOBMFPSL3IDVKRMUB6YIGU commoncrawl2_likely_broken/OB/OBQTIKQW3MIEYJPGE4NR5WGPDUZC3ULY commoncrawl2_likely_broken/BC/BCZSFNQAB62TUBURWG6B3ZOZCG5IH46P commoncrawl2_likely_broken/TV/TVMANAJVH2VQVABYX6LCVO5KTERLFS2I.pdf Out of 543,805 PDFs in our test set, and given that they're broken, I'm not overly concerned. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Wednesday, March 15, 2017 5:30 PM To: dev@pdfbox.apache.org Subject: Re: [VOTE] Release Apache PDFBox 2.0.5 Am 15.03.2017 um 19:07 schrieb Tilman Hausherr: > Thanks Tim! > > I looked at newExceptionsInBDetails.xlsx (247 entries). IMHO no need > to stop the release, the number of entries in > fixedExceptionsInBDetails.xlsx (506) is larger, and the files with exceptions > are cut off. I agree. However, I've checked one of the files 015664.pdf and it looks like an regression. I can open it using 2.0.4 but get the described exception with 2.0.5 :-( BR Andreas > I'll create an issue about these. > > I had a look at content_diffs_with_exceptions.xlsx, then looking only > at govdocs there, all are similar or better. > > Tilman > > Am 15.03.2017 um 00:03 schrieb Allison, Timothy B.: >> +1 >> >> I ran a comparison with 2.0.5-rc1 and (I think) 2.0.4 against ~500k >> files from our regression corpus. >> >> I haven't had a chance to do much digging, but I wanted to share what >> I had as soon as I had it. >> >> Reports are here: >> https://github.com/tballison/share/blob/master/pdfbox_comparisons/rep >> orts_pdfbox_2.0.5-rc1.zip >> >> >> Lots more "common words". Many fewer exceptions. There may be a >> regression that is causing 244 new exceptions, but on balance, the >> improvements are impressive. >> >> >> java.io.IOException: Missing root object specification in trailer. >> at >> org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(C >> OSParser.java:2169) >> >> at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:222) >> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:271) >> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) >> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:922) >> at >> ... >> >> -Original Message- >> From: Timo Boehme [mailto:timo.boe...@ontochem.com] >> Sent: Tuesday, March 14, 2017 9:11 AM >> To: dev@pdfbox.apache.org >> Subject: Re: [VOTE] Release Apache PDFBox 2.0.5 >> >> Hi, >> >> +1 >> >> Maybe we should add the >> -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true >> setting (introduced with 2.0.4) to the Migration/Getting Started >> Web-Pages. I had to look through my emails in order to find it and it >> really makes a difference (at least on some systems) if there are a >> lot of images on a page - so far we only have the >> -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider >> setting documented (which did not help in my case). At least the user >> may try it out if rendering gets slow on some pages; it may not be a >> good general setting as it also may slow rendering down a bit on pages with >> few large images. >> >> >> Best, >> Timo >> >> >> Am 13.03.2017 um 19:18 schrieb Andreas Lehmkuehler: >>> Hi, >>> >>> a candidate for the PDFBox 2.0.5 release is available at: >>> >>> https://dist.apache.org/repos/dist/dev/pdfbox/2.0.5/ >>> >>> The release candidate is a zip archive of the sources in: >>> >>> http://svn.apache.org/repos/asf/pdfbox/tags/2.0.5/ >>> >>> The SHA1 checksum of the archive is >>> 9521349be859498dfdd0e0f2a5d02b082f097ab1. >>> >>> Please vote on releasing this package as Apache PDFBox 2.0.5. >>> The vote is open for the next 72 hours and passes if a majority of
RE: [VOTE] Release Apache PDFBox 2.0.5
+1 I ran a comparison with 2.0.5-rc1 and (I think) 2.0.4 against ~500k files from our regression corpus. I haven't had a chance to do much digging, but I wanted to share what I had as soon as I had it. Reports are here: https://github.com/tballison/share/blob/master/pdfbox_comparisons/reports_pdfbox_2.0.5-rc1.zip Lots more "common words". Many fewer exceptions. There may be a regression that is causing 244 new exceptions, but on balance, the improvements are impressive. java.io.IOException: Missing root object specification in trailer. at org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2169) at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:222) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:271) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:984) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:922) at ... -Original Message- From: Timo Boehme [mailto:timo.boe...@ontochem.com] Sent: Tuesday, March 14, 2017 9:11 AM To: dev@pdfbox.apache.org Subject: Re: [VOTE] Release Apache PDFBox 2.0.5 Hi, +1 Maybe we should add the -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true setting (introduced with 2.0.4) to the Migration/Getting Started Web-Pages. I had to look through my emails in order to find it and it really makes a difference (at least on some systems) if there are a lot of images on a page - so far we only have the -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider setting documented (which did not help in my case). At least the user may try it out if rendering gets slow on some pages; it may not be a good general setting as it also may slow rendering down a bit on pages with few large images. Best, Timo Am 13.03.2017 um 19:18 schrieb Andreas Lehmkuehler: > Hi, > > a candidate for the PDFBox 2.0.5 release is available at: > > https://dist.apache.org/repos/dist/dev/pdfbox/2.0.5/ > > The release candidate is a zip archive of the sources in: > > http://svn.apache.org/repos/asf/pdfbox/tags/2.0.5/ > > The SHA1 checksum of the archive is > 9521349be859498dfdd0e0f2a5d02b082f097ab1. > > Please vote on releasing this package as Apache PDFBox 2.0.5. > The vote is open for the next 72 hours and passes if a majority of at > least three +1 PDFBox PMC votes are cast. > > [ ] +1 Release this package as Apache PDFBox 2.0.5 > [ ] -1 Do not release this package because... > > > Here is my +1 > > BR > Andreas Lehmkühler > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > -- Timo Boehme OntoChem IT Solutions GmbH Blücherstraße 24 06120 Halle (Saale) Germany phone: +49 345 478 047 4| fax: +49 345 478 047 1 email: timo.boe...@ontochem.com | web: www.ontochem.com HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824 managing director : Lutz Weber - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
tika-eval
All, I finally got around to adding tika-eval[1] to Apache Tika. If you have any interest in comparing the output of different tools/versions/parameters on text extraction, give it a try. You don't need to use Tika or format the output in a specific format; plain UTF-8 text will work. Tilman, I generalized your common word count methodology. The code now runs language id on the text and then counts the common words for that language. Lots more work remains. Thank you, all, for contributing to the methodologies! Cheers, Tim [1] https://wiki.apache.org/tika/TikaEval
RE: [VOTE] Release Apache PDFBox 2.0.4
+1 Comparisons available here: http://162.242.228.174/reports/reports_pdfbox_2_0_3_vs_2_0_4-rc1.tar.bz2 No new exceptions, a few fixed exceptions, better content extraction. Thank you, all! -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Monday, December 12, 2016 12:53 PM To: dev@pdfbox.apache.org Subject: [VOTE] Release Apache PDFBox 2.0.4 Hi, a candidate for the PDFBox 2.0.4 release is available at: https://dist.apache.org/repos/dist/dev/pdfbox/2.0.4/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/pdfbox/tags/2.0.4/ The SHA1 checksum of the archive is 4b1844a268d65b05ac371a848c0b8c27f390052b. Please vote on releasing this package as Apache PDFBox 2.0.4. The vote is open for the next 72 hours and passes if a majority of at least three +1 PDFBox PMC votes are cast. [ ] +1 Release this package as Apache PDFBox 2.0.4 [ ] -1 Do not release this package because... Here is my +1 BR Andreas Lehmkühler - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: New releases
Or, turns out the 12th...ugh. I just kicked off the regression tests. Should have results within 8 hours. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, November 29, 2016 3:36 PM To: dev@pdfbox.apache.org Subject: RE: New releases +1 I should have time to run the regression tests against 2.0.x the week of the 5th. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, November 29, 2016 2:21 AM To: dev@pdfbox.apache.org Subject: Re: New releases Am 28.11.2016 um 21:38 schrieb Andreas Lehmkuehler: > Am 24.11.2016 um 14:43 schrieb Andreas Lehmkuehler: >> Hi, >> >> I'm planing to cut new releases for 1.8.x and 2.0.x in about 2-3 >> weeks from now. > > I'm going to cut the releases as follows: > > - 1.8.13 on Monday 5th of December > - 2.0.4 on Monday 12th of December +1 - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
FW: ApacheCon Miami is coming in May.
> ApacheCon and Apache Big Data will be held at the Intercontinental in Miami, > Florida, May 16-18, 2017 I plan to attend. Who's in? Any interest in collaborating on a talk or submitting your own? Cheers, Tim -Original Message- From: Rich Bowen [mailto:rbo...@apache.org] Sent: Wednesday, November 30, 2016 1:34 PM Subject: ApacheCon Miami is coming in May. Dear Apache Committer, ApacheCon and Apache Big Data will be held at the Intercontinental in Miami, Florida, May 16-18, 2017. ...
RE: New releases
+1 I should have time to run the regression tests against 2.0.x the week of the 5th. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, November 29, 2016 2:21 AM To: dev@pdfbox.apache.org Subject: Re: New releases Am 28.11.2016 um 21:38 schrieb Andreas Lehmkuehler: > Am 24.11.2016 um 14:43 schrieb Andreas Lehmkuehler: >> Hi, >> >> I'm planing to cut new releases for 1.8.x and 2.0.x in about 2-3 >> weeks from now. > > I'm going to cut the releases as follows: > > - 1.8.13 on Monday 5th of December > - 2.0.4 on Monday 12th of December +1 - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Apache Tika's public regression corpus
All, I recently blogged about some of the work we're doing with a large scale regression corpus to make Tika, POI and PDFBox more robust and to identify regressions before release. If you'd like to chip in with recommendations, requests or Hadoop/Spark clusters (why not shoot for the stars), please do! http://openpreservation.org/blog/2016/10/04/apache-tikas-regression-corpus-tika-1302/ Many thanks, again, to Rackspace for our vm and to Common Crawl and govdocs1 for most of our files! Cheers, Tim - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: New PDFBox Committer
Thank you, all! I am honored to join your ranks! Cheers, Tim -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Monday, September 19, 2016 7:55 AM To: dev@pdfbox.apache.org Subject: New PDFBox Committer Hi, I'm happy to announce that the PDFBox PMC has decided to offer committership in Apache PDFBox to Tim Allison. He has accepted the offer and should have his committer-bits ready by now. As all other committers Tim has joined the PMC as well. BR Andreas Lehmkühler P.S.: Some of you might already know Tim as committer and PMC Member of Apache Tika and Apache POI. - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: PDFBox 2.0.3 TIKA comparison
Great. Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Thursday, September 15, 2016 12:03 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3 TIKA comparison Am 14.09.2016 um 20:50 schrieb Tilman Hausherr: > >> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.: >>> >>> >>> There are some regressions in content extraction, but overall, >>> content extraction looks to have improved quite a bit. Looks like >>> ~2 million more "common English words" via Tilman's methodology. > > After some wandering around I finally looked at content extraction > only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words. > It turned out that all files were from Delaware courts, so I've > decided to look only at one single file, > Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW. > The extracted text with 2.0.2 and 2.0.3 is > > IN THE COUR T OF CHAN CER Y O F TH E STA TE OF D ELA WARE > > in 2.0.1 and 1.8 it is > > IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE > > For 1.8 the explanation is that text extraction takes words, while in > 2.* each character is taken alone. > > The bad result in 2.0.3 is because of an incorrect /W array. The space > has a width of 3, while other characters have widths between 200 and > 722. So PDFBox believes that there are spaces where there are none. The story is different, the space width (which is 250, not 3 - the table is a ranges array) is NOT taken from the space glyph, but from an average of all glyphs. It's a good thing I looked past in history. The breaking change was in rev 1744613 (PDFBOX-3354) and is related to the calculation of the average glyph width. Before rev 1744613 the averageWidth was always 0 (due to a bug likely accidentally introduced in some refactoring), which was corrected to a default value (1000) in text extraction. Starting with rev 1744613 an average width was calculated, but due to many 0 values (over 65534) in the /W ranges array, the result was unreliable: /W [1 1 0 2 3 250 4 10 0 11 12 333 13 14 0 15 15 250 16 16 333 17 17 250 18 18 277 19 19 0 20 23 500 24 35 0 36 36 722 37 37 666 38 39 722 40 40 666 41 41 610 42 43 777 44 44 389 45 45 0 46 46 777 47 47 666 48 48 943 49 49 722 50 50 777 51 51 610 52 52 0 53 53 722 54 54 556 55 55 666 56 57 722 59 59 0 60 60 722 61 67 0 68 68 500 69 69 556 70 70 443 71 71 556 72 72 443 73 73 333 74 74 500 75 75 556 76 76 277 77 77 0 78 78 556 79 79 277 80 80 833 81 81 556 82 82 500 83 84 556 85 85 443 86 86 389 87 87 333 88 88 556 89 89 0 90 90 722 91 92 500 93 178 0 179 180 500 181 181 0 182 182 333 183 751 0 752 752 198 753 794 0 795 795 612 796 1126 0 1127 1127 125 1129 1129 2000 1130 65534 0] Solution: ignore widths that are <=0. 0 values in PDFont are already ignored in PDFont, but not in PDCIDFont. Before the solution: 0.52861196. After the fix: 549.8571. I'll open an issue and commit a fix after sending this. It won't be in 2.0.3, but in 2.0.4. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: PDFBox 2.0.3 TIKA comparison
Perfect. Thank you! -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Thursday, September 15, 2016 8:31 AM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3 TIKA comparison Am 15.09.2016 um 13:52 schrieb Allison, Timothy B.: >> The one apparent major new exception for PDF files was apparently fixed >> before 2.0.3. So, please ignore that one! > > Wait...if possible, please confirm that you did fix this recently (within the > last week or two). I ran pdfbox app's (2.0.3) on a handful of triggering > files and didn't get the exception...however, it is possible that > multithreading might trigger this exception. I've fixed that 2 days ago, it's part of the RC. BR Andreas > > java.lang.NullPointerException > at > org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118) > at > org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151) > at > org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.(DictionaryEncoding.java:128) > at > org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129) > at > org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:209) > at > org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) > at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143) > at > org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) > at > org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) > at > org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) > at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) > at > org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) > at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) > at > org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407) > at > org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104) > at > org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182) > at > org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115) > at > org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: PDFBox 2.0.3 TIKA comparison
If this doesn't look like something you've recently fixed, I can rerun with the actual 2.0.3-rc1 (only on pdfs!) and see if I'm still getting this exception. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, September 15, 2016 7:53 AM To: dev@pdfbox.apache.org Subject: RE: PDFBox 2.0.3 TIKA comparison Importance: High > The one apparent major new exception for PDF files was apparently fixed > before 2.0.3. So, please ignore that one! Wait...if possible, please confirm that you did fix this recently (within the last week or two). I ran pdfbox app's (2.0.3) on a handful of triggering files and didn't get the exception...however, it is possible that multithreading might trigger this exception. java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118) at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151) at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.(DictionaryEncoding.java:128) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:209) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) at org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: PDFBox 2.0.3 TIKA comparison
> The one apparent major new exception for PDF files was apparently fixed > before 2.0.3. So, please ignore that one! Wait...if possible, please confirm that you did fix this recently (within the last week or two). I ran pdfbox app's (2.0.3) on a handful of triggering files and didn't get the exception...however, it is possible that multithreading might trigger this exception. java.lang.NullPointerException at org.apache.pdfbox.pdmodel.font.encoding.Encoding.overwrite(Encoding.java:118) at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.applyDifferences(DictionaryEncoding.java:151) at org.apache.pdfbox.pdmodel.font.encoding.DictionaryEncoding.(DictionaryEncoding.java:128) at org.apache.pdfbox.pdmodel.font.PDSimpleFont.readEncoding(PDSimpleFont.java:129) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:209) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:143) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:111) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:146) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:74) at org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:158) at org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:407) at org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104) at org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:182) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115) at org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: PDFBox 2.0.3 TIKA comparison
> Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content > extraction looks to have improved quite a bit" :-) Y, absolutely. Thank _you_ for reviewing the output and all of your other work, of course! Cheers, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Wednesday, September 14, 2016 2:50 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3 TIKA comparison > Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.: >> >> >> There are some regressions in content extraction, but overall, >> content extraction looks to have improved quite a bit. Looks like ~2 >> million more "common English words" via Tilman's methodology. After some wandering around I finally looked at content extraction only, at column P ("TOP_10_MORE_IN_A") for cells with meaningful words. It turned out that all files were from Delaware courts, so I've decided to look only at one single file, Y5TLCWTIAE3FYDVJTV2TXRZGXLEDUNSW. The extracted text with 2.0.2 and 2.0.3 is IN THE COUR T OF CHAN CER Y O F TH E STA TE OF D ELA WARE in 2.0.1 and 1.8 it is IN THE COURT OF CHANCERY OF THE STATE OF DELAWARE For 1.8 the explanation is that text extraction takes words, while in 2.* each character is taken alone. The bad result in 2.0.3 is because of an incorrect /W array. The space has a width of 3, while other characters have widths between 200 and 722. So PDFBox believes that there are spaces where there are none. The only mystery that remains is why it worked in 2.0.1. Maybe that one took an average glyph width for spaces, or the width value from the font itself. I'll find this out later, but it isn't a high priority. A look at column Q ("TOP_10_MORE_IN_B") shows a lot of good entries, so yes, "content extraction looks to have improved quite a bit" :-) Thanks for testing! Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: PDFBox 2.0.3 TIKA comparison
That was caused by a cap we placed in Tika in extracting XMP history: TIKA-1999 [1] We haven't switched to XMPBox...still on JempBox from 1.8.x. https://issues.apache.org/jira/browse/TIKA-1999 -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Wednesday, September 14, 2016 12:52 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3 TIKA comparison Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.: > https://github.com/tballison/share/blob/master/tika_comparisons/report > s_tika_20160904_dev.zip > > This run was against the full corpus, not just PDFs. I used a fairly recent > nightly build of PDFBox and POI's 3.15-rc1. > > The one apparent major new exception for PDF files was apparently fixed > before 2.0.3. So, please ignore that one! > > There are some regressions in content extraction, but overall, content > extraction looks to have improved quite a bit. Looks like ~2 million more > "common English words" via Tilman's methodology. > > Let me know if you have any questions. I wonder what happened here: commoncrawl2/SH/SHMSOEBK4QOJO5CY7BIWWDH6GHSTOXYM metadata went from 6766 to 4134. Is this a TIKA thing, or is this because of a change from xmpbox to jempbox? Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: PDFBox 2.0.3?
https://github.com/tballison/share/blob/master/tika_comparisons/reports_tika_20160904_dev.zip This run was against the full corpus, not just PDFs. I used a fairly recent nightly build of PDFBox and POI's 3.15-rc1. The one apparent major new exception for PDF files was apparently fixed before 2.0.3. So, please ignore that one! There are some regressions in content extraction, but overall, content extraction looks to have improved quite a bit. Looks like ~2 million more "common English words" via Tilman's methodology. Let me know if you have any questions. Cheers, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, September 12, 2016 12:58 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3? Am 12.09.2016 um 18:47 schrieb Allison, Timothy B.: > Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/ > Tika 1.13). Yes please, when you have the time, I expect no more changes. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: PDFBox 2.0.3?
Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/ Tika 1.13). Cheers, Tim
PDFBox 2.0.3?
PDFBox Colleagues, We may be heading towards a release of Tika 1.14 over the next month, maybe early September. Any plans for a PDFBox 2.0.3 release before then? I'm happy to recommend to my Tika-colleagues a delay if you would naturally be releasing somewhere around then. Best, Tim
FW: Apache Tika used to parse the Panama papers!
Looks like quite a few PDFs [0]... Couldn't have done it without you! Cheers, Tim P.S. Tip of the hat to Andreas for rt the link! [0] https://twitter.com/bigdata/status/717346207312392192 -Original Message- From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, April 05, 2016 6:47 PM To: d...@tika.apache.org Cc: pr...@apache.org Subject: Apache Tika used to parse the Panama papers! FYI: http://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak/?utm_campaign=ForbesTech&utm_source=TWITTER&utm_medium=social&utm_channel=Technology&linkId=23087770#709893771df5 BTW I know Thomas and am in touch..he wrote an article about MEMEX last year. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++
RE: shading/relocating 1.8.x?
Got it. That's what I had assumed. I'll hold off on opening truncated file issue(s) on PDFBox's JIRA... I opened TIKA-1912 to track this on our side. Thank you, again! Best, Tim -Original Message- From: Andreas Lehmkühler [mailto:andr...@lehmi.de] Sent: Tuesday, March 29, 2016 7:12 AM To: dev@pdfbox.apache.org Subject: RE: shading/relocating 1.8.x? > "Allison, Timothy B." hat am 28. März 2016 um > 21:02 > geschrieben: > > > Oh, wow, so it really might be possible without too much work? I'm > more than happy to supply examples. :) Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox most likel runs into a NPE. IMHO we have to implement some sort of an on demand parser which is able to handle null-values for specific parts of a pdf without throwing any exception. > Should I open an issue? Thanks, but I'm going to do that soon, as some other things should be done as well. BR Andreas
RE: shading/relocating 1.8.x?
Oh, wow, so it really might be possible without too much work? I'm more than happy to supply examples. :) Should I open an issue? -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Monday, March 28, 2016 10:58 AM To: dev@pdfbox.apache.org Subject: Re: shading/relocating 1.8.x? Am 25.03.2016 um 17:39 schrieb John Hewson: > >> On 23 Mar 2016, at 06:20, Allison, Timothy B. wrote: >> >> All, >> We've upgraded to 2.0.0 on Tika. Many thanks again! >> One of our users is interested in continuing to use the >> classic/SequentialParser, or at least having it available as a back-off >> parser for corrupt pdfs [0]. > > Using the old parser really isn’t a good idea, it’s known to be pretty > broken. I think that we would be much better off making sure the new parser > can handle truncated files. We already do a lot of repair in the new parser, > so this doesn’t seem like to much work? Maybe Andreas can comment further? The biggest issue here is the truncated stream or dictionary. The current version simply throws an exception when running into such constellations. We have to implement some algorithm to ignore such incomplete parts of a pdf if possible. BR Andreas > > Do we have some JIRA issues which identify some of these cases? > > — John > >> Would you be willing to distribute a shaded/relocated 1.8.x app so that we >> could load both 1.8.x and 2.0.0 in the same jvm without collisions? Or, is >> there a better solution? > > I wouldn’t recommend doing that, because you’re going to be stuck with using > 1.8 for everything, not just parsing, at least as far as corrupt/truncated > files are concerned. > > — John > >> Thank you! >> >> Cheers, >> >> Tim >> >> [0] >> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360 >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: shading/relocating 1.8.x?
See: https://issues.apache.org/jira/browse/TIKA-1285?focusedCommentId=15214111&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15214111 -Original Message- From: John Hewson [mailto:j...@jahewson.com] Sent: Friday, March 25, 2016 1:03 PM To: dev@pdfbox.apache.org Subject: Re: shading/relocating 1.8.x? > On 25 Mar 2016, at 09:44, Tilman Hausherr wrote: > > Am 25.03.2016 um 17:39 schrieb John Hewson: >> Do we have some JIRA issues which identify some of these cases? > > https://issues.apache.org/jira/browse/PDFBOX-3265 > Great! Does anyone else have some others? — John > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: shading/relocating 1.8.x?
Hi John, Normally, I'd agree. And, y, I've been extremely grateful for the effort put into dealing with noisy PDFs in 2.0.0. However, I think that the Tika user requesting this is interested in getting what he can from truncated and truly broken files -- e.g. Common Crawl data which (I think) truncates files at 1MB or may have had an interrupt during download. My basic rule for opening an issue is if AR or another pdf parser can't parse it, I'm not going to ask for help. I wouldn't want to direct your all's efforts to dealing with the edge cases of truncated files. If the old PDFParser is able to get something out because it parsed sequentially, then it would be neat to be able to have that available with very little effort. In Tika, we envision allowing users to configure combinations of parsers for a given file, this would be the perfect case for the back-off-on-exception strategy -- if there's an exception with 2.0.0, try again with 1.8.x. I'll try shading/relocating next week, and see whether that works as expected. Thank you, all, again! Cheers, Tim -Original Message- From: John Hewson [mailto:j...@jahewson.com] Sent: Friday, March 25, 2016 1:03 PM To: dev@pdfbox.apache.org Subject: Re: shading/relocating 1.8.x? > On 25 Mar 2016, at 09:44, Tilman Hausherr wrote: > > Am 25.03.2016 um 17:39 schrieb John Hewson: >> Do we have some JIRA issues which identify some of these cases? > > https://issues.apache.org/jira/browse/PDFBOX-3265 > Great! Does anyone else have some others? — John > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
shading/relocating 1.8.x?
All, We've upgraded to 2.0.0 on Tika. Many thanks again! One of our users is interested in continuing to use the classic/SequentialParser, or at least having it available as a back-off parser for corrupt pdfs [0]. Would you be willing to distribute a shaded/relocated 1.8.x app so that we could load both 1.8.x and 2.0.0 in the same jvm without collisions? Or, is there a better solution? Thank you! Cheers, Tim [0] https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360 - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: The Apache® Software Foundation announces Apache PDFBox™ v2.0
Congratulations! And, thank you! Cheers, Tim -Original Message- From: Andreas Lehmkühler [mailto:andr...@lehmi.de] Sent: Monday, March 21, 2016 10:11 AM To: us...@pdfbox.apache.org Subject: Fwd: The Apache® Software Foundation announces Apache PDFBox™ v2.0 Ursprüngliche Nachricht Von: Sally Khudairi Gesendet: 21. März 2016 12:44:18 MEZ An: Apache Announce List Betreff: The Apache® Software Foundation announces Apache PDFBox™ v2.0 >> this announcement is available online at https://s.apache.org/Ly9B Milestone release of Open Source Java tool for working with PDF documents features dozens of improvements and enhancements Forest Hill, MD —21 March 2016— The Apache Software Foundation (ASF), the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today the availability of Apache® PDFBox™ v2.0, the Open Source Java tool for working with Portable Document Format (PDF) documents. PDF was first released by Adobe Systems in 1993, and became an ISO International Standard - ISO 32000-1 in 2008. Apache PDFBox allows for the creation of new PDF documents, manipulation, rendering, signing of existing documents and the ability to extract content from documents. In addition, PDFBox includes several command line utilities. In February 2015, the project became the first Open Source Partner Organization of the PDF Association. "PDF is a very popular and easy to use format for document exchange. It is used by millions of people every day, however the format itself is quite complicated and a real challenge to write a piece of software to work with it," said Andreas Lehmkühler, Vice President of Apache PDFBox. "This new major release of PDFBox includes a lot of improvements, fixes and new features which should make the life easier for our users." Under The Hood The Apache PDFBox library enables users to create new PDF documents, manipulate existing documents, extract content, digitally sign, print, and validate files against the PDF/A-1b standard. Its command line utilities include encrypt, decrypt, overlay, debugger, merger, PDFToImage, and TextToPDF. PDFBox v2.0 reflects 1,167 solved issues, 418 of which were back-ported to v1.8, as well as dozens of improvements and enhancements. Highlights include: - improved rendering and text extraction - Unicode support for PDF creation - overhauled interactive forms support - extended signing and encryption support - overhauled parser including a self-healing mechanism for malformed or corrupted PDFs - reduced memory/resources footprint including fine grained control of memory usage - enhanced preflight module for PDF/A-1b conformance checking - rearranged package structure to allow smaller runtime environments A guide to migrating to v2.0 is available at http://pdfbox.apache.org/2.0/migration.html , with community support at http://pdfbox.apache.org/mailinglists.html "We thank all the people from our small but fine community for their support," explained Lehmkühler. "Special thanks also goes to our fellow colleagues from the Apache Tika project for their cooperation in stress-testing with a corpus of 250,000 PDF files." "We are grateful for the Google Summer of Code program," said PDFBox committer Tilman Hausherr. "The project allowed us to hire students to improve 3D rendering and the PDFDebugger stand-alone application, which also sped up our own bug finding." "Apache PDFBox v2.0 is a significant milestone as it took us several years to complete," added Lehmkühler. "This long-awaited release is the collective achievement of more than 150 individuals who have contributed code to date. Without their frequent contributions it wouldn't be possible to drive a project like PDFBox." Availability and Oversight Apache PDFBox software is released under the Apache License v2.0 and is overseen by a self-selected team of active contributors to the project. A Project Management Committee (PMC) guides the Project's day-to-day operations, including community development and product releases. For downloads, documentation, and ways to become involved with Apache PDFBox, visit http://pdfbox.apache.org/ About The Apache Software Foundation (ASF) Established in 1999, the all-volunteer Foundation oversees more than 350 leading Open Source projects, including Apache HTTP Server --the world's most popular Web server software. Through the ASF's meritocratic process known as "The Apache Way," more than 550 individual Members and 5,300 Committers successfully collaborate to develop freely available enterprise-grade software, benefiting millions of users worldwide: thousands of software solutions are distributed under the Apache License; and the community actively participates in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's official user conference, trainings, and expo. The ASF is a US 501(c)(3)
RE: roadmap for XMPBox?
Thank you, Beat. Y, as one of our devs pointed out, we're using that already in Tika in our XMP module for writing XMP...we haven't looked into using it for extraction. -Original Message- From: Beat Weisskopf [mailto:weissk...@glue.ch] Sent: Friday, March 11, 2016 3:40 AM To: dev@pdfbox.apache.org Subject: Re: roadmap for XMPBox? Hi all As a third option: What about the BSD-licensed Adobe XMP Toolkit? At least verapdf seems to use a fork it: https://github.com/veraPDF/veraPDF-xmp Cheers, beat Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.: > All, > > > >When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch > from our current reliance on jempbox to XMPBox. I recently extracted ~70k > XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, > there were exceptions on roughly 40% of the XMPs. > > > >I’m including a table below of the counts of exception messages. Are > there any plans to make XMPBox more lenient or is this what we can expect > going forward? > > > >As always, I’m more than happy to help with files and tests. Let me know > what I can do. > > > > Cheers, > > > >Tim > > > > No XmpParsingException on 42,022 files. > > > > > > > > Exceptions: > > > Cannot find a definition for the namespace > http://ns.adobe.com/pdfx/1.3/ > > 13403 > > Type 'originalDocumentID' not defined in > http://ns.adobe.com/xap/1.0/sType/ResourceRef# > > 3710 > > Missing pdfaSchema:property in type definition > > 3113 > > Expecting namespace 'adobe:ns:meta/' and found > 'http://www.w3.org/1999/02/22-rdf-syntax-ns#' > > 2867 > > Invalid array type, expecting Seq and found Bag [prefix=dc; > name=creator] > > 927 > > Invalid array type, expecting Alt and found Seq [prefix=dc; > name=description] > > 723 > > Cannot find a definition for the namespace > http://ns.adobe.com/xmp/InDesign/private > > 710 > > Invalid array type, expecting Bag and found Seq [prefix=dc; > name=subject] > > 654 > > Cannot find a definition for the namespace > http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/ > > 522 > > Failed to parse > > 492 > > Invalid array definition, expecting Seq and found > com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; > name=date] > > 370 > > Cannot find a definition for the namespace > http://ns.adobe.com/illustrator/1.0/ > > 262 > > Cannot find a definition for the namespace > http://ns.adobe.com/xfa/promoted-desc/ > > 188 > > Failed to instanciate property in xmp:CreateDate > > 144 > > Schema is not set in this document : > http://www.w3.org/1999/02/22-rdf-syntax-ns# > > 125 > > Expecting local name 'xmpmeta' and found 'xapmeta' > > 94 > > Cannot find a definition for the namespace > http://www.rwjf.org/rwjf/1.0 > > 84 > > Failed to instanciate property in xap:CreateDate > > 74 > > Invalid array definition, expecting Bag and found > com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; > name=language] > > 68 > > Invalid array definition, expecting Alt and found > com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; > name=title] > > 49 > > Cannot find a definition for the namespace http://www.sap.com > > 46 > > Failed to instanciate property in exif:ColorSpace > > 33 > > Failed to instanciate property in xmpMM:History > > 28 > > xmp should start with a processing instruction > > 26 > > Cannot find a definition for the namespace > http://prismstandard.org/namespaces/basic/2.0/ > > 24 > > Cannot find a definition for the namespace > http://www.npes.org/pdfx/ns/id/ > > 21 > > Cannot find a definition for the namespace > http://ns.InsiderSoftware.com/fontlist/1.0/ > > 14 > > Invalid array definition, expecting Seq and found > com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; > name=creator] > > 14 > > Failed to instanciate property in xmp:MetadataDate > > 12 > > Cannot find a definition for the namespace > http://ns.xinet.com/webnative/private/1.0/ > > 10 > > Failed to instanciate property in xap:ModifyDate > > 10 > > Failed to instanciate property in xmp:ModifyDate > > 10 > > Type 'params' not defined in > http://ns.adobe.com/xap/1.0/sType/ResourceEvent# > > 9 > > Invalid array type, expecting Seq and found Bag [prefix=xmpMM; > name=History] > >
RE: roadmap for XMPBox?
> The comment I made is just my personal opinion. ... Maybe improve XMPBox as > you suggested (I did have a look but it doesn't seem easy). Oh, ok, so it isn't necessarily set in stone. What do other PDFBox devs think? Is there interest in modifying XmpBox to be more lenient? Not for 2.0.0, obviously... :) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, March 08, 2016 12:56 PM To: dev@pdfbox.apache.org Subject: Re: roadmap for XMPBox? Am 08.03.2016 um 18:44 schrieb Allison, Timothy B.: > Got it. Thank you. I wanted to confirm that nothing had changed since last > summer (PDFBOX-2855). > > Are you taking bug reports for jempbox or is that entirely eol'd? Yes, I recently fixed a bug there. > Any recommendations for a somewhat lenient, Apache license-compatible XMP > parser? Sorry, don't know. > Might it make sense to include in the README or in the package > javadocs something about the goals for XmpBox? It is entirely > possible that I missed the warning. ;) The comment I made is just my personal opinion. It's your comment that made me realize that with XMPBox, we can't parse some files that are not PDF/A compatible but are correct XMP files. I don't have an idea what to do. Maybe improve XMPBox as you suggested (I did have a look but it doesn't seem easy). Maybe resurrect Jempbox, or use the 1.8 version. Tilman > > Thank you, again. > > Best, > >Tim > > -Original Message- > From: Tilman Hausherr [mailto:thaush...@t-online.de] > Sent: Tuesday, March 08, 2016 12:13 PM > To: dev@pdfbox.apache.org > Subject: Re: roadmap for XMPBox? > > I think the problem is that XmpBox was written for PDF/A checking, so it > fails with XMPs that are not PDF/A. For example, file 000142.pdf has the > schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A: > http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_p > roperties_in_pdfa-1_2008-03-20.pdf > > And no, there are no plans for anything on XMP at this time... > > Tilman > > > Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.: >> All, >> >> >> >> When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch >> from our current reliance on jempbox to XMPBox. I recently extracted ~70k >> XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, >> there were exceptions on roughly 40% of the XMPs. >> >> >> >> I’m including a table below of the counts of exception messages. Are >> there any plans to make XMPBox more lenient or is this what we can expect >> going forward? >> >> >> >> As always, I’m more than happy to help with files and tests. Let me >> know what I can do. >> >> >> >>Cheers, >> >> >> >> Tim >> >> >> >> No XmpParsingException on 42,022 files. >> >> >> >> >> >> >> >> Exceptions: >> >> >> Cannot find a definition for the namespace >> http://ns.adobe.com/pdfx/1.3/ >> >> 13403 >> >> Type 'originalDocumentID' not defined in >> http://ns.adobe.com/xap/1.0/sType/ResourceRef# >> >> 3710 >> >> Missing pdfaSchema:property in type definition >> >> 3113 >> >> Expecting namespace 'adobe:ns:meta/' and found >> 'http://www.w3.org/1999/02/22-rdf-syntax-ns#' >> >> 2867 >> >> Invalid array type, expecting Seq and found Bag [prefix=dc; >> name=creator] >> >> 927 >> >> Invalid array type, expecting Alt and found Seq [prefix=dc; >> name=description] >> >> 723 >> >> Cannot find a definition for the namespace >> http://ns.adobe.com/xmp/InDesign/private >> >> 710 >> >> Invalid array type, expecting Bag and found Seq [prefix=dc; >> name=subject] >> >> 654 >> >> Cannot find a definition for the namespace >> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/ >> >> 522 >> >> Failed to parse >> >> 492 >> >> Invalid array definition, expecting Seq and found >> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; >> name=date] >> >> 370 >> >> Cannot find a definition for the namespace >> http://ns.adobe.com/illustrator/1.0/ >> >> 262 >> >> Cannot find a definition for the namespace >> http://ns.adobe.com/xfa/promoted-desc/ >> >> 188 >> >&
RE: roadmap for XMPBox?
Got it. Thank you. I wanted to confirm that nothing had changed since last summer (PDFBOX-2855). Are you taking bug reports for jempbox or is that entirely eol'd? Any recommendations for a somewhat lenient, Apache license-compatible XMP parser? Might it make sense to include in the README or in the package javadocs something about the goals for XmpBox? It is entirely possible that I missed the warning. ;) Thank you, again. Best, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, March 08, 2016 12:13 PM To: dev@pdfbox.apache.org Subject: Re: roadmap for XMPBox? I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A: http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf And no, there are no plans for anything on XMP at this time... Tilman Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.: > All, > > > >When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch > from our current reliance on jempbox to XMPBox. I recently extracted ~70k > XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, > there were exceptions on roughly 40% of the XMPs. > > > >I’m including a table below of the counts of exception messages. Are > there any plans to make XMPBox more lenient or is this what we can expect > going forward? > > > >As always, I’m more than happy to help with files and tests. Let me know > what I can do. > > > > Cheers, > > > >Tim > > > > No XmpParsingException on 42,022 files. > > > > > > > > Exceptions: > > > Cannot find a definition for the namespace > http://ns.adobe.com/pdfx/1.3/ > > 13403 > > Type 'originalDocumentID' not defined in > http://ns.adobe.com/xap/1.0/sType/ResourceRef# > > 3710 > > Missing pdfaSchema:property in type definition > > 3113 > > Expecting namespace 'adobe:ns:meta/' and found > 'http://www.w3.org/1999/02/22-rdf-syntax-ns#' > > 2867 > > Invalid array type, expecting Seq and found Bag [prefix=dc; > name=creator] > > 927 > > Invalid array type, expecting Alt and found Seq [prefix=dc; > name=description] > > 723 > > Cannot find a definition for the namespace > http://ns.adobe.com/xmp/InDesign/private > > 710 > > Invalid array type, expecting Bag and found Seq [prefix=dc; > name=subject] > > 654 > > Cannot find a definition for the namespace > http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/ > > 522 > > Failed to parse > > 492 > > Invalid array definition, expecting Seq and found > com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; > name=date] > > 370 > > Cannot find a definition for the namespace > http://ns.adobe.com/illustrator/1.0/ > > 262 > > Cannot find a definition for the namespace > http://ns.adobe.com/xfa/promoted-desc/ > > 188 > > Failed to instanciate property in xmp:CreateDate > > 144 > > Schema is not set in this document : > http://www.w3.org/1999/02/22-rdf-syntax-ns# > > 125 > > Expecting local name 'xmpmeta' and found 'xapmeta' > > 94 > > Cannot find a definition for the namespace > http://www.rwjf.org/rwjf/1.0 > > 84 > > Failed to instanciate property in xap:CreateDate > > 74 > > Invalid array definition, expecting Bag and found > com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; > name=language] > > 68 > > Invalid array definition, expecting Alt and found > com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; > name=title] > > 49 > > Cannot find a definition for the namespace http://www.sap.com > > 46 > > Failed to instanciate property in exif:ColorSpace > > 33 > > Failed to instanciate property in xmpMM:History > > 28 > > xmp should start with a processing instruction > > 26 > > Cannot find a definition for the namespace > http://prismstandard.org/namespaces/basic/2.0/ > > 24 > > Cannot find a definition for the namespace > http://www.npes.org/pdfx/ns/id/ > > 21 > > Cannot find a definition for the namespace > http://ns.InsiderSoftware.com/fontlist/1.0/ > > 14 > > Invalid array definition, expecting Seq and found > com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; > name=creator] > > 14 > > Failed to
RE: roadmap for XMPBox?
XLSX summary and 89MB of XMPs available here: http://162.242.228.174/xmp_work/ -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, March 07, 2016 1:31 PM To: dev@pdfbox.apache.org Subject: roadmap for XMPBox? All, When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox. I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs. I’m including a table below of the counts of exception messages. Are there any plans to make XMPBox more lenient or is this what we can expect going forward? As always, I’m more than happy to help with files and tests. Let me know what I can do. Cheers, Tim No XmpParsingException on 42,022 files. Exceptions: Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/ 13403 Type 'originalDocumentID' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef# 3710 Missing pdfaSchema:property in type definition 3113 Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#' 2867 Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator] 927 Invalid array type, expecting Alt and found Seq [prefix=dc; name=description] 723 Cannot find a definition for the namespace http://ns.adobe.com/xmp/InDesign/private 710 Invalid array type, expecting Bag and found Seq [prefix=dc; name=subject] 654 Cannot find a definition for the namespace http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/ 522 Failed to parse 492 Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=date] 370 Cannot find a definition for the namespace http://ns.adobe.com/illustrator/1.0/ 262 Cannot find a definition for the namespace http://ns.adobe.com/xfa/promoted-desc/ 188 Failed to instanciate property in xmp:CreateDate 144 Schema is not set in this document : http://www.w3.org/1999/02/22-rdf-syntax-ns# 125 Expecting local name 'xmpmeta' and found 'xapmeta' 94 Cannot find a definition for the namespace http://www.rwjf.org/rwjf/1.0 84 Failed to instanciate property in xap:CreateDate 74 Invalid array definition, expecting Bag and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=language] 68 Invalid array definition, expecting Alt and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title] 49 Cannot find a definition for the namespace http://www.sap.com 46 Failed to instanciate property in exif:ColorSpace 33 Failed to instanciate property in xmpMM:History 28 xmp should start with a processing instruction 26 Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.0/ 24 Cannot find a definition for the namespace http://www.npes.org/pdfx/ns/id/ 21 Cannot find a definition for the namespace http://ns.InsiderSoftware.com/fontlist/1.0/ 14 Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=creator] 14 Failed to instanciate property in xmp:MetadataDate 12 Cannot find a definition for the namespace http://ns.xinet.com/webnative/private/1.0/ 10 Failed to instanciate property in xap:ModifyDate 10 Failed to instanciate property in xmp:ModifyDate 10 Type 'params' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceEvent# 9 Invalid array type, expecting Seq and found Bag [prefix=xmpMM; name=History] 8 Type 'documentName' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef# 8 Cannot find a definition for the namespace http://www.day.com/dam/1.0 7 Cannot find a definition for the namespace ptc 7 Failed to instanciate property in xapMM:History 6 Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; name=YCbCrPositioning] 5 Schema is not set in this document : http://purl.org/dc/elements/1.1/ 5 Cannot find a definition for the namespace http://www.extensis.com/meta/FontSense/ 4 Excepted xpacket 'end' attribute (must be present and placed in first) 4 Invalid array type, expecting Seq and found Bag [prefix=photoshop; name=TextLayers] 3 Schema is not set in this document : http://ns.adobe.com/xap/1.0/ 3 no message (NPE) 2 Cannot find a definition for the namespace http://laserfiche.com/xmp/schema/1.0/ 2 Cannot find a definition for the namespace http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/ 2 Cannot find a definition for the namespace http://ns.adobe.com/camera-raw-settings/1.0/ 2 Failed to instanciate property in xapRights:Marked 2 Invalid array type, expecting Alt and found Bag [prefix=dc; name=titl
roadmap for XMPBox?
All, When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox. I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs. I’m including a table below of the counts of exception messages. Are there any plans to make XMPBox more lenient or is this what we can expect going forward? As always, I’m more than happy to help with files and tests. Let me know what I can do. Cheers, Tim No XmpParsingException on 42,022 files. Exceptions: Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/ 13403 Type 'originalDocumentID' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef# 3710 Missing pdfaSchema:property in type definition 3113 Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#' 2867 Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator] 927 Invalid array type, expecting Alt and found Seq [prefix=dc; name=description] 723 Cannot find a definition for the namespace http://ns.adobe.com/xmp/InDesign/private 710 Invalid array type, expecting Bag and found Seq [prefix=dc; name=subject] 654 Cannot find a definition for the namespace http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/ 522 Failed to parse 492 Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=date] 370 Cannot find a definition for the namespace http://ns.adobe.com/illustrator/1.0/ 262 Cannot find a definition for the namespace http://ns.adobe.com/xfa/promoted-desc/ 188 Failed to instanciate property in xmp:CreateDate 144 Schema is not set in this document : http://www.w3.org/1999/02/22-rdf-syntax-ns# 125 Expecting local name 'xmpmeta' and found 'xapmeta' 94 Cannot find a definition for the namespace http://www.rwjf.org/rwjf/1.0 84 Failed to instanciate property in xap:CreateDate 74 Invalid array definition, expecting Bag and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=language] 68 Invalid array definition, expecting Alt and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title] 49 Cannot find a definition for the namespace http://www.sap.com 46 Failed to instanciate property in exif:ColorSpace 33 Failed to instanciate property in xmpMM:History 28 xmp should start with a processing instruction 26 Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.0/ 24 Cannot find a definition for the namespace http://www.npes.org/pdfx/ns/id/ 21 Cannot find a definition for the namespace http://ns.InsiderSoftware.com/fontlist/1.0/ 14 Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=creator] 14 Failed to instanciate property in xmp:MetadataDate 12 Cannot find a definition for the namespace http://ns.xinet.com/webnative/private/1.0/ 10 Failed to instanciate property in xap:ModifyDate 10 Failed to instanciate property in xmp:ModifyDate 10 Type 'params' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceEvent# 9 Invalid array type, expecting Seq and found Bag [prefix=xmpMM; name=History] 8 Type 'documentName' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef# 8 Cannot find a definition for the namespace http://www.day.com/dam/1.0 7 Cannot find a definition for the namespace ptc 7 Failed to instanciate property in xapMM:History 6 Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; name=YCbCrPositioning] 5 Schema is not set in this document : http://purl.org/dc/elements/1.1/ 5 Cannot find a definition for the namespace http://www.extensis.com/meta/FontSense/ 4 Excepted xpacket 'end' attribute (must be present and placed in first) 4 Invalid array type, expecting Seq and found Bag [prefix=photoshop; name=TextLayers] 3 Schema is not set in this document : http://ns.adobe.com/xap/1.0/ 3 no message (NPE) 2 Cannot find a definition for the namespace http://laserfiche.com/xmp/schema/1.0/ 2 Cannot find a definition for the namespace http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/ 2 Cannot find a definition for the namespace http://ns.adobe.com/camera-raw-settings/1.0/ 2 Failed to instanciate property in xapRights:Marked 2 Invalid array type, expecting Alt and found Bag [prefix=dc; name=title] 2 Invalid array type, expecting Alt and found Seq [prefix=dc; name=title] 2 Invalid array type, expecting Seq and found Alt [prefix=dc; name=creator] 2 Cannot find a definition for the namespace http://ns.cambridgeassociates.com/status/1.0/ 1 Cannot find a definition for the namespace http://ns.computershare.com.au/ccs/1.0/ 1 Cannot f
RE: [VOTE] Release Apache PDFBox 1.8.11
Turns out there are the same exceptions with those combinations of java versions and OS for 1.8.10. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, January 12, 2016 1:49 PM To: dev@pdfbox.apache.org Subject: RE: [VOTE] Release Apache PDFBox 1.8.11 Ah, ok. Thank you. With the following on linux, all is well: java version "1.8.0_66" Java(TM) SE Runtime Environment (build 1.8.0_66-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) The test failures were with: Linux java version "1.7.0_75" OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode) and Windows: java version "1.8.0_66" Java(TM) SE Runtime Environment (build 1.8.0_66-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Tuesday, January 12, 2016 1:33 PM To: dev@pdfbox.apache.org Subject: Re: [VOTE] Release Apache PDFBox 1.8.11 Hmmm, everything works fine for me after a fresh checkout, at least on linux. Maybe some issue with the jdk? Which one are you using for your tests? I ran into some problems (test failures during rendering) whenever using the openjdk which comes with fedora by default. Those disappear once I switch to oracle jdk. BR, Andreas Am 12.01.2016 um 18:58 schrieb Allison, Timothy B.: > All, > > Is this user error? I'm getting 3 test exceptions in both Windows and > Linux in the preflight module after I did an svn checkout from: > http://svn.apache.org/repos/asf/pdfbox/tags/1.8.11/ > > Revision: 1724292 > Node Kind: directory > Schedule: normal > Last Changed Author: lehmi > Last Changed Rev: 1724120 > Last Changed Date: 2016-01-11 14:26:57 -0500 (Mon, 11 Jan 2016) > > > In RHEL: > Failed tests: > testAllInfoSynhcronized(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation): > expected:<0> but was:<2> > > testBadPrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynchron > izedMetadataValidation): null expected:<7.[4.]2> but was:<7.[]2> > > testdoublePrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynch > ronizedMetadataValidation) > > Tests run: 72, Failures: 3, Errors: 0, Skipped: 0 > > In Windows: > "C:\Program Files\Java\jdk1.8\bin\java" true System property > 'pdfa.invalid' not defined, will not run TestValidaDirectory > TestIsartorValidationFromClasspath2.initializeParameters(): No input > files found System property 'pdfa.valid' not defined, will not run > TestValidaDirectory > > junit.framework.AssertionFailedError: > Expected :0 > Actual :2 > > > > > at junit.framework.Assert.fail(Assert.java:47) > at junit.framework.Assert.failNotEquals(Assert.java:283) > at junit.framework.Assert.assertEquals(Assert.java:64) > at junit.framework.Assert.assertEquals(Assert.java:195) > at junit.framework.Assert.assertEquals(Assert.java:201) > at > org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation.testAllInfoSynhcronized(TestSynchronizedMetadataValidation.java:422) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) > at > org.junit.internal.runners.statements.RunBefores.evaluate(Ru
RE: [VOTE] Release Apache PDFBox 1.8.11
Ah, ok. Thank you. With the following on linux, all is well: java version "1.8.0_66" Java(TM) SE Runtime Environment (build 1.8.0_66-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) The test failures were with: Linux java version "1.7.0_75" OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode) and Windows: java version "1.8.0_66" Java(TM) SE Runtime Environment (build 1.8.0_66-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Tuesday, January 12, 2016 1:33 PM To: dev@pdfbox.apache.org Subject: Re: [VOTE] Release Apache PDFBox 1.8.11 Hmmm, everything works fine for me after a fresh checkout, at least on linux. Maybe some issue with the jdk? Which one are you using for your tests? I ran into some problems (test failures during rendering) whenever using the openjdk which comes with fedora by default. Those disappear once I switch to oracle jdk. BR, Andreas Am 12.01.2016 um 18:58 schrieb Allison, Timothy B.: > All, > > Is this user error? I'm getting 3 test exceptions in both Windows and > Linux in the preflight module after I did an svn checkout from: > http://svn.apache.org/repos/asf/pdfbox/tags/1.8.11/ > > Revision: 1724292 > Node Kind: directory > Schedule: normal > Last Changed Author: lehmi > Last Changed Rev: 1724120 > Last Changed Date: 2016-01-11 14:26:57 -0500 (Mon, 11 Jan 2016) > > > In RHEL: > Failed tests: > testAllInfoSynhcronized(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation): > expected:<0> but was:<2> > > testBadPrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation): > null expected:<7.[4.]2> but was:<7.[]2> > > testdoublePrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynch > ronizedMetadataValidation) > > Tests run: 72, Failures: 3, Errors: 0, Skipped: 0 > > In Windows: > "C:\Program Files\Java\jdk1.8\bin\java" true System property > 'pdfa.invalid' not defined, will not run TestValidaDirectory > TestIsartorValidationFromClasspath2.initializeParameters(): No input > files found System property 'pdfa.valid' not defined, will not run > TestValidaDirectory > > junit.framework.AssertionFailedError: > Expected :0 > Actual :2 > > > > > at junit.framework.Assert.fail(Assert.java:47) > at junit.framework.Assert.failNotEquals(Assert.java:283) > at junit.framework.Assert.assertEquals(Assert.java:64) > at junit.framework.Assert.assertEquals(Assert.java:195) > at junit.framework.Assert.assertEquals(Assert.java:201) > at > org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation.testAllInfoSynhcronized(TestSynchronizedMetadataValidation.java:422) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) > at org.junit.runners.ParentRunner.run(ParentRunner.java:236) > at org.junit.runners.Suite.runChild(Suite.java:128) > at org.junit.runners.Suite.runChild(Suite.java:24) > at org.junit.runners.ParentRunner$3.run(Pare
RE: [VOTE] Release Apache PDFBox 1.8.11
All, Is this user error? I'm getting 3 test exceptions in both Windows and Linux in the preflight module after I did an svn checkout from: http://svn.apache.org/repos/asf/pdfbox/tags/1.8.11/ Revision: 1724292 Node Kind: directory Schedule: normal Last Changed Author: lehmi Last Changed Rev: 1724120 Last Changed Date: 2016-01-11 14:26:57 -0500 (Mon, 11 Jan 2016) In RHEL: Failed tests: testAllInfoSynhcronized(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation): expected:<0> but was:<2> testBadPrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation): null expected:<7.[4.]2> but was:<7.[]2> testdoublePrefixSchemas(org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation) Tests run: 72, Failures: 3, Errors: 0, Skipped: 0 In Windows: "C:\Program Files\Java\jdk1.8\bin\java" true System property 'pdfa.invalid' not defined, will not run TestValidaDirectory TestIsartorValidationFromClasspath2.initializeParameters(): No input files found System property 'pdfa.valid' not defined, will not run TestValidaDirectory junit.framework.AssertionFailedError: Expected :0 Actual :2 at junit.framework.Assert.fail(Assert.java:47) at junit.framework.Assert.failNotEquals(Assert.java:283) at junit.framework.Assert.assertEquals(Assert.java:64) at junit.framework.Assert.assertEquals(Assert.java:195) at junit.framework.Assert.assertEquals(Assert.java:201) at org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation.testAllInfoSynhcronized(TestSynchronizedMetadataValidation.java:422) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:24) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.junit.runner.JUnitCore.run(JUnitCore.java:157) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:234) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) junit.framework.ComparisonFailure: null Expected :7.4.2 Actual :7.2 at junit.framework.Assert.assertEquals(Assert.java:81) at junit.framework.Assert.assertEquals(Assert.java:87) at org.apache.pdfbox.preflight.metadata.TestSynchronizedMetadataValidation.testBadPrefixSchemas(TestSynchronizedMetadataValidation.java:499) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.
comparison of 1.8.10 and 2.0 trunk
All, Apologies for the delay. I finally finished the comparison of text extracted from 100k pdfs with 1.8.10 and 2.0 trunk (pdfbox-2.0.0-20151022.051152-1783). The reports are available here [0]. I botched the commit message... I haven't had a chance to review the results. The eval code is still in development and there might be bugs! To view the docs, prepend: h t t p : slash slash one six two . two four two . two two eight . one seven four/docs/ ... just don't let any of the scrapers read that. ;) The docs include all those within our corpus that had a rtl word (when extracted with 1.8.10 :)) and then I took a random selection to fill out ~100k pdfs from common crawl and govdocs1. Let me know if you have any questions. Cheers, Tim [0] https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip
RE: Subclassing BaseParser?
Nope, not missing anything...that did it, of course. Sorry. Seems like more overhead than we need for this use, but that works. Will go with that. Thank you! -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, October 05, 2015 3:07 PM To: dev@pdfbox.apache.org Subject: Re: Subclassing BaseParser? John did that one and he's very sensitive on making stuff public. What prevents you from extending COSParser as in that example code I posted at that time? Or am I missing something, i.e. was this for something different? Tilman Am 05.10.2015 um 13:25 schrieb Allison, Timothy B.: > > [switching to dev because this is entering into dev land] > > Y, I did and do have it working for the 1.8.x branch. I either had it > working for the 2.0 branch before the change to SequentialSource was > made, or there's a chance that I never got around to integrating it > into our dev wrapper for 2.0. LHappy to be back working on 2.0, though! > > Is there any chance of making SequentialSource and its friends public > or possibly offering a RandomAccessRead constructor for BaseParser? > Or, is there another cleaner solution to allow subclassing of > BaseParser outside of o.a.p.pdfparser? > > Plan D: move the "fixing" of metadata strings that are improperly > PDFEncoded into PDFBox. > > Thank you! > > Best, > > Tim > > *From:*Tilman Hausherr [mailto:thaush...@t-online.de] > *Sent:* Sunday, October 04, 2015 8:34 AM > *To:* us...@pdfbox.apache.org > *Subject:* Re: Subclassing BaseParser? > > Am 03.10.2015 um 21:13 schrieb Allison, Timothy B.: > > All, > >I'm probably suffering from the same failure that led to > (https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370), > but is it possible to subclass BaseParser outside of the oap.pdfparser > package? > >The actual subclassing of BaseParser is no problem, but what can I > substitute for SequentialSource, given that it and RandomAccessSource are > package-private? > > > But later in that issue, you wrote that "all is well", so I didn't > bother. But it is true that currently, BaseParser can only be extended > within its package, due to RandomAccessSource and SequentialSource. > There's even a netbeans warning because of that. > > > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: Subclassing BaseParser?
[switching to dev because this is entering into dev land] Y, I did and do have it working for the 1.8.x branch. I either had it working for the 2.0 branch before the change to SequentialSource was made, or there's a chance that I never got around to integrating it into our dev wrapper for 2.0. :( Happy to be back working on 2.0, though! Is there any chance of making SequentialSource and its friends public or possibly offering a RandomAccessRead constructor for BaseParser? Or, is there another cleaner solution to allow subclassing of BaseParser outside of o.a.p.pdfparser? Plan D: move the "fixing" of metadata strings that are improperly PDFEncoded into PDFBox. Thank you! Best, Tim From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Sunday, October 04, 2015 8:34 AM To: us...@pdfbox.apache.org Subject: Re: Subclassing BaseParser? Am 03.10.2015 um 21:13 schrieb Allison, Timothy B.: All, I'm probably suffering from the same failure that led to (https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370), but is it possible to subclass BaseParser outside of the oap.pdfparser package? The actual subclassing of BaseParser is no problem, but what can I substitute for SequentialSource, given that it and RandomAccessSource are package-private? But later in that issue, you wrote that "all is well", so I didn't bother. But it is true that currently, BaseParser can only be extended within its package, due to RandomAccessSource and SequentialSource. There's even a netbeans warning because of that. [cid:image001.png@01D0FF3E.1BA8B1C0]
RE: help debugging integration of PDFBox 2.0.0 trunk
>>Xmx doesn't limit native memory, so if there's a leak associated with AWT, >>ImageIO C libraries, or some other JNI library, the process can grow without >>limit. Such a leak could be due to a bug, or us not calling close() somewhere. Got it. Ok. Is there anything I can do to help figure out what's going on? - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: help debugging integration of PDFBox 2.0.0 trunk
With ~125k files, and there were 10 restarts, 7x with exit code=137 and 2x with exit code=1. The exit code=253 was a timeout for: 26.pdf. Happens roughly every 8-10 minutes. 502907 2015-07-20 17:13:24,420 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=137 numRestarts=0 receivedRestartMessage=false) 986787 2015-07-20 17:21:28,300 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=253 numRestarts=1 receivedRestartMessage=false) 1574818 2015-07-20 17:31:16,331 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=137 numRestarts=2 receivedRestartMessage=false) 2040741 2015-07-20 17:39:02,254 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=137 numRestarts=3 receivedRestartMessage=false) 2545702 2015-07-20 17:47:27,215 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=137 numRestarts=4 receivedRestartMessage=false) 3084672 2015-07-20 17:56:26,185 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=137 numRestarts=5 receivedRestartMessage=false) 3571616 2015-07-20 18:04:33,129 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=1 numRestarts=6 receivedRestartMessage=false) 4021342 2015-07-20 18:12:02,855 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=1 numRestarts=7 receivedRestartMessage=false) 4503161 2015-07-20 18:20:04,674 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=137 numRestarts=8 receivedRestartMessage=false) 4958976 2015-07-20 18:27:40,489 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Must restart process (exitValue=137 numRestarts=9 receivedRestartMessage=false) 5437962 2015-07-20 18:35:39,475 [main] WARN org.apache.tika.batch.BatchProcessDriverCLI - Hit the maximum number of process restarts. Driver is shutting down now. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, July 20, 2015 3:18 PM To: dev@pdfbox.apache.org Subject: RE: help debugging integration of PDFBox 2.0.0 trunk Y, sorry, Tilman. I'm not running into problems with 1.8.9 and straight text extraction, though. Following Timo's recommendation...looks like a memory issue. Let me know if I should post the full file or move to a more recent version of Java. :) # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 403177472 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space ... # Out of Memory Error (os_linux.cpp:2798), pid=14958, tid=140419564971776 ... vm_info: OpenJDK 64-Bit Server VM (24.75-b04) for linux-amd64 JRE (1.7.0_75-b13), built on Jan 16 2015 09:15:47 by "mockbuild" with gcc 4.8.2 20140120 (Red Hat 4.8.2-16) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, July 20, 2015 1:28 PM To: dev@pdfbox.apache.org Subject: Re: help debugging integration of PDFBox 2.0.0 trunk Am 20.07.2015 um 18:12 schrieb Allison, Timothy B.: > All, >While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm > finding two issues that are difficult to reproduce. > > Background: > Tika-batch has a parent process that kicks off a Tika processor in a child > process, if that dies unexpectedly, the parent kicks it off again. I'm > running with 10 consumer/parser threads and -Xmx5g on an (8 cpu/8GB vm); RHEL > 7, Linux cloud-server-02 3.10.0-123.20.1.el7.x86_64 #1 SMP Wed Jan 21 > 09:45:55 EST 2015 x86_64 x86_64 x86_64 GNU/Linux) > > Two problems: > > 1) The child process exits with value 1. I'm catching Throwable around > the primary execution call in the child process and logging it; nothing shows > up in the log files from that part of the code. From the parser log files (at > trace), I can tell which 10 files were being processed at the time, but I'm > not seeing any other information about what caused the exit. When I run > against just those 10 files, all is ok. > > 2) The OS is killing the child far more often than it does with 1.8.9 > (exit code 137). > > For the second problem, I'll wait until the optimizations to the caching are > completed before I start worrying about that. However, do you have any > recommendations on how to figure out what's going on with 1)? I'm also having some problem with that system... with my test software, I have observed that java uses more and more space, despite it being told not to use more than a certain amount with -Xmx. After some time, the "process killer" kills the a
RE: help debugging integration of PDFBox 2.0.0 trunk
Y, sorry, Tilman. I'm not running into problems with 1.8.9 and straight text extraction, though. Following Timo's recommendation...looks like a memory issue. Let me know if I should post the full file or move to a more recent version of Java. :) # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 403177472 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space ... # Out of Memory Error (os_linux.cpp:2798), pid=14958, tid=140419564971776 ... vm_info: OpenJDK 64-Bit Server VM (24.75-b04) for linux-amd64 JRE (1.7.0_75-b13), built on Jan 16 2015 09:15:47 by "mockbuild" with gcc 4.8.2 20140120 (Red Hat 4.8.2-16) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, July 20, 2015 1:28 PM To: dev@pdfbox.apache.org Subject: Re: help debugging integration of PDFBox 2.0.0 trunk Am 20.07.2015 um 18:12 schrieb Allison, Timothy B.: > All, >While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm > finding two issues that are difficult to reproduce. > > Background: > Tika-batch has a parent process that kicks off a Tika processor in a child > process, if that dies unexpectedly, the parent kicks it off again. I'm > running with 10 consumer/parser threads and -Xmx5g on an (8 cpu/8GB vm); RHEL > 7, Linux cloud-server-02 3.10.0-123.20.1.el7.x86_64 #1 SMP Wed Jan 21 > 09:45:55 EST 2015 x86_64 x86_64 x86_64 GNU/Linux) > > Two problems: > > 1) The child process exits with value 1. I'm catching Throwable around > the primary execution call in the child process and logging it; nothing shows > up in the log files from that part of the code. From the parser log files (at > trace), I can tell which 10 files were being processed at the time, but I'm > not seeing any other information about what caused the exit. When I run > against just those 10 files, all is ok. > > 2) The OS is killing the child far more often than it does with 1.8.9 > (exit code 137). > > For the second problem, I'll wait until the optimizations to the caching are > completed before I start worrying about that. However, do you have any > recommendations on how to figure out what's going on with 1)? I'm also having some problem with that system... with my test software, I have observed that java uses more and more space, despite it being told not to use more than a certain amount with -Xmx. After some time, the "process killer" kills the application. Seems something changed in java memory management: http://karunsubramanian.com/websphere/one-important-change-in-memory-management-in-java-8/ I did some investigation on this a few months ago, but gave up out of frustration. Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org