RE: PDFBox 2.0.9 release?

2018-03-15 Thread Allison, Timothy B.
> PDFBOX-4153 is solved. How about cutting the release next Monday? +1 and thank you! Tim

RE: PDFBox 2.0.9 release?

2018-03-12 Thread Allison, Timothy B.
Reports are available here: http://162.242.228.174/reports/pdfbox-2.0.9-pre-rc1_reports_2.tar.bz2

RE: PDFBox 2.0.9 release?

2018-03-12 Thread Allison, Timothy B.
> ok => Tim, please start again Will start now.

RE: PDFBox 2.0.9 release?

2018-03-09 Thread Allison, Timothy B.
I'm happy to run the regression tests again when all final changes for 2.0.9-RC1 are made. I'm really excited to be able to include jbig2. We'll start the Tika release process for 1.18 as soon as PDFBox 2.0.9 is available. Thank you, all! Cheers, Tim

RE: PDFBox 2.0.9 release?

2018-03-09 Thread Allison, Timothy B.
, 2018 3:52 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.9 release? Am 08.03.2018 um 21:35 schrieb Allison, Timothy B.: > I've kicked off an initial run with 2.0.9-SNAPSHOT on the regression corpus. > While I had some time, I wanted to see if there were any early indicators of >

RE: PDFBox 2.0.9 release?

2018-03-08 Thread Allison, Timothy B.
is really, truly ready for rc1. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, March 7, 2018 8:03 AM To: dev@pdfbox.apache.org Subject: RE: PDFBox 2.0.9 release? Argh. Sorry for my delay. Y. I have time, and I'm happy to help Tilman if he'd

RE: PDFBox 2.0.9 release?

2018-03-07 Thread Allison, Timothy B.
Argh. Sorry for my delay. Y. I have time, and I'm happy to help Tilman if he'd prefer to lead the regression testing process again. Cheers, Tim -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Monday, March 5, 2018 1:28 PM To:

RE: Running tika-eval on the Rackspace vm

2017-11-07 Thread Allison, Timothy B.
that was on the server. >> >> I'm thinking of creating a new "A" with 2.0.4 with current tika trunk >> and then compare with the "B" I did. >> >> Tilman >> >> >> Am 03.11.2017 um 22:14 schrieb Tilman Hausherr: >>> Am 03.11.20

RE: Running tika-eval on the Rackspace vm

2017-11-03 Thread Allison, Timothy B.
Tilman, Thank you for the toe-stubbing. I'm sorry that it wasn't easier... I created a new user with collab permissions and ran through the process. You are right about the privileges on the tmp directory... POI needs a tmp directory to write xlsx. I created a tmp directory in /work/eval

RE: Running tika-eval on the Rackspace vm

2017-11-01 Thread Allison, Timothy B.
Sorry. Fixed. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, October 31, 2017 6:08 PM To: dev@pdfbox.apache.org Subject: Re: Running tika-eval on the Rackspace vm Am 31.10.2017 um 20:53 schrieb Allison, Timothy B.: >> It's not possible to

RE: Running tika-eval on the Rackspace vm

2017-10-31 Thread Allison, Timothy B.
> It's not possible to rename / remove the files / directories mentioned in > part 1 due to not having the permissions. Gah. Sorry. Tilman, I added you to "collab" and chgrp to collab on /work /data2/docs /data3/batch_runs and /data4/batch_runs. > The directory is named batch-apps, not

RE: Running tika-eval on the Rackspace vm

2017-10-31 Thread Allison, Timothy B.
version - is this the "good" version, so I could simply > download tika-app and put it there? Or just build tika with a specific > PDFBox version? > > Tilman > > Am 23.10.2017 um 20:54 schrieb Allison, Timothy B.: >> All, >> >> If anyone would l

RE: [VOTE] Release Apache PDFBox 2.0.8

2017-10-30 Thread Allison, Timothy B.
+1 Thank you, Andreas, Tilman, and team! Cheers, Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, October 30, 2017 3:57 PM To: dev@pdfbox.apache.org Subject: Re: [VOTE] Release Apache PDFBox 2.0.8 Am 30.10.2017 um 19:47 schrieb Andreas

RE: 2.0.8?

2017-10-27 Thread Allison, Timothy B.
cycles, > you might kick of another run. > > Andreas > > Am 23.10.2017 um 20:11 schrieb Allison, Timothy B.: >> Reports here: >> http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz >> >> I haven't looked yet. >> >> -Original

RE: 2.0.8?

2017-10-26 Thread Allison, Timothy B.
expect any new regressions, but if you have some cycles, you might kick of another run. Andreas Am 23.10.2017 um 20:11 schrieb Allison, Timothy B.: > Reports here: > http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz > > I haven't looked yet. > > -Original M

Running tika-eval on the Rackspace vm

2017-10-23 Thread Allison, Timothy B.
All, If anyone would like to join the fun in running tika-eval on the Rackspace vm, I posted this: https://wiki.apache.org/tika/TikaEvalOnVM . You’ll need access to the vm, of course, but I’m happy to grant that to anyone who wants to chip in and help with regression tests. There are some

RE: 2.0.8?

2017-10-23 Thread Allison, Timothy B.
open regression in 2.0.8, Tilmans test run hasn't showed any regression. Please re-run your tests again to see if we can proceed with 2.0.8, I'd really like to push it out. TIA again, Andreas Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler: > Am 03.10.2017 um 15:38 schrieb Allison, Timoth

RE: 2.0.8?

2017-10-23 Thread Allison, Timothy B.
-run your tests again to see if we can proceed with 2.0.8, I'd really like to push it out. TIA again, Andreas Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler: > Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.: >> >>> And yes, we need another regressions run if possible

RE: 2.0.8?

2017-10-10 Thread Allison, Timothy B.
> However, PDFBox 2.0.8-SNAPSHOT has a more 0, 1, 2 and 3s... > > The TOP_10_MORE_IN_B column in the contents report shows that there are 15 > more 0's, 15 more 1's 11 more '2's etc. > > 0: 15 | 1: 15 | 2: 11 | 20: 5 | 3: 2 | 4: 2 >Yeah but where do they come from? Not from the pure text

RE: 2.0.8?

2017-10-10 Thread Allison, Timothy B.
: 5 | 3: 2 | 4: 2 -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, October 10, 2017 11:47 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 09.10.2017 um 22:26 schrieb Allison, Timothy B.: > Thank you, Andreas, for fixing the slow parse on corr

RE: 2.0.8?

2017-10-10 Thread Allison, Timothy B.
] Sent: Tuesday, October 10, 2017 11:47 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 09.10.2017 um 22:26 schrieb Allison, Timothy B.: > Thank you, Andreas, for fixing the slow parse on corrupt file so quickly! > > Reports are here: > http://162.242.228.174/reports/pdfbox_2_0_7_Vs

RE: 2.0.8?

2017-10-09 Thread Allison, Timothy B.
----- From: Allison, Timothy B. Sent: Monday, October 9, 2017 4:26 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.8? Thank you, Andreas, for fixing the slow parse on corrupt file so quickly! Reports are here: http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz -Original Message

RE: 2.0.8?

2017-10-09 Thread Allison, Timothy B.
Thank you, Andreas, for fixing the slow parse on corrupt file so quickly! Reports are here: http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, October 9, 2017 8:02 AM To: dev

RE: 2.0.8?

2017-10-09 Thread Allison, Timothy B.
Starting process now. -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Sunday, October 8, 2017 10:12 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.8? Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.: > >> And yes, we need another regres

RE: 2.0.8?

2017-10-03 Thread Allison, Timothy B.
>And yes, we need another regressions run if possible Sounds good. Will do once I hear that we're good to go. Thank you! - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail:

RE: 2.0.8?

2017-10-03 Thread Allison, Timothy B.
>>Let me know when we're ready for another round. >I've already started ... RC2? No need for another regression run? Thank you again!

Re: 2.0.8?

2017-10-03 Thread Allison, Timothy B.
All, Again, my apologies for post-useful/late results! Ugh... Thank you, Andreas and Tilman! Let me know when we're ready for another round. Cheers, Tim -Original Message- From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org] Sent: Tuesday,

RE: 2.0.8?

2017-10-02 Thread Allison, Timothy B.
> Re 308576.pdf: the text extraction has a huge loss, but a manual check shows > it is identical. However that file has the NPE from PDActionURI.getURI(), > could it be that this results in an abort of text extraction? Same for 569017.pdf. Likely. There are two "per file pair contents" files.

RE: 2.0.8?

2017-10-02 Thread Allison, Timothy B.
Sorry all for taking longer than expected! File under "this information would have been useful..." ☹ -Original Message----- From: Allison, Timothy B. Sent: Monday, October 2, 2017 3:59 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.8? Reports are here: http://162.242.228.1

RE: 2.0.8?

2017-10-02 Thread Allison, Timothy B.
Reports are here: http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take2.tar.gz Looks like some new NPEs. I'll take a look at the metadata diffs. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, October 2, 2017 9:24 AM To: dev@pdfbox.apache.org

RE: 2.0.8?

2017-10-02 Thread Allison, Timothy B.
Sounds good. I kicked off the eval process yesterday, but because of a bug in our config-file reader and/or user error in modifying the config file, I wound up with 500k pdfs parsed by our EmptyParserno results. I restarted the eval process just now. I should have results in 6 hours.

RE: 2.0.8?

2017-09-25 Thread Allison, Timothy B.
> I'd go for postponing in order to fix that regression - what about setting > the date to next Monday? +1 I’m happy pushing it out later if the fix happens >= Friday and we want to run the full regression tests again. Thank you, Andreas!

RE: 2.0.8?

2017-09-18 Thread Allison, Timothy B.
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8-SNAPSHOT_reports.tar.gz is now available. I haven't yet had a chance to look at either... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, September 18, 2017 12:51 PM To: dev@pdfbox.apache.org

RE: 2.0.8?

2017-09-18 Thread Allison, Timothy B.
anything pending which should be included? > > How about cutting the release in a week or two from now? > > @Tim please run a test 2.0.7 vs. 2.0.8 if possible > > Andreas > > Am 11.09.2017 um 23:24 schrieb Allison, Timothy B.: >>> I hope there aren't any new

RE: 2.0.8?

2017-09-12 Thread Allison, Timothy B.
> because I'm ill but I expect to be my old self later this week. I'm sorry to hear it! I hope that you are feeling better soon! > I'd also like to have a test from version 2.0.4 compared to trunk because > 2.0.5 was the version were the tests weren't done, the problems were fixed in > 2.0.6

2.0.8?

2017-09-11 Thread Allison, Timothy B.
>I hope there aren't any new regressions. Happy to help find them! :) On a related note, do we have a sense of the schedule for PDFBox 2.0.8? I'd like to include it in Tika's last Java 7 release...end of Sept, middle of Oct., or whenever 2.0.8 is out. :) -Original Message- From:

RE: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

2017-08-15 Thread Allison, Timothy B.
-terminal interactive form fields not handled recursively Hi Tim, > Am 15.08.2017 um 17:31 schrieb Allison, Timothy B. <talli...@mitre.org>: > > All, > I can't tell if the triggering file is corrupt or how we want to handle it > on the PDFBox side. The problem is

FW: [jira] [Commented] (TIKA-2442) Non-terminal interactive form fields not handled recursively

2017-08-15 Thread Allison, Timothy B.
All, I can't tell if the triggering file is corrupt or how we want to handle it on the PDFBox side. The problem is that the parent node is a PDTextField -- a PDTerminalField -- so we don't/can't look for children, even though it actually does have pointers in Kids. The output from

FW: Tika content detection and crawled "remote" content

2017-07-05 Thread Allison, Timothy B.
All, > If anyone is interested in using the detected MIME types or anything else > from Common Crawl - I'm happy to help! The URL index [4] contains now a new > field "mime-detected" which makes it easy to search or grep for confusion > pairs. This is an amazing step forward for sampling PDF

RE: tika-eval

2017-05-22 Thread Allison, Timothy B.
Ha. I hadn't realized the video was available until this post. Thank you! > And here is the talk about it Tim gave at ApacheCon > > https://youtu.be/vRPTPMwI53k?list=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp > > I've enjoyed it (the video). So did I! Tilman

RE: [VOTE] Release Apache PDFBox 2.0.6

2017-05-12 Thread Allison, Timothy B.
+1 Thank you! -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Friday, May 12, 2017 12:13 PM To: dev@pdfbox.apache.org Subject: [VOTE] Release Apache PDFBox 2.0.6 Hi, a candidate for the PDFBox 2.0.6 release is available at:

RE: 2.0.6 release ?

2017-05-12 Thread Allison, Timothy B.
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz Looks good to me on a very cursory look.

RE: 2.0.6 release ?

2017-05-11 Thread Allison, Timothy B.
> It isn't that secret as Tim posted it somewhere in this thread :) I've added throttling to httpd (I think) so we should be ok, and y, the address is out in the open now. Let me know if I should kick off another run. Thank you, all!

RE: 2.0.6 release ?

2017-05-10 Thread Allison, Timothy B.
Haven't had a chance to look. Reports are here: http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz

RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
I won't have results immediately. :) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 4:13 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.: > UGH. I'm so wrong.

RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
. But after removing it, it still works with the three files... so the question is, can this parameter ever be null, or not? Tilman Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.: > Should we return false if the node is null in PDPageTree#isPageTreeNode (66 > new NPE exceptions)

RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
With lots of empty pages... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 3:57 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 release ? Doh. AR can't open it. Sorry. Chrome appears to be able to open it. -Original Message

RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 3:20 PM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.: >> I've fixed all remaining regression tickets (in the end it was >> exactl

RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
>I've fixed all remaining regression tickets (in the end it was exactly 1) Great! Thank you! Let me know when I should kick off another eval. - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands,

RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Added a page count comparison report under "content/": http://162.242.228.174/reports/reports_pdfbox_2_0_6c.tar.gz -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, May 9, 2017 2:39 PM To: dev@pdfbox.apache.org Subject: RE: 2.0.6 relea

RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
] Sent: Tuesday, May 9, 2017 10:07 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.: > Tilman's initial recommendation Can you do me another favor? Have a column with the size in any table that is about individual files. I th

RE: 2.0.6 release ?

2017-05-09 Thread Allison, Timothy B.
Y. Will do. Meetings beckon, so it will take a few hours. :( -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, May 9, 2017 10:07 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 09.05.2017 um 02:43 schrieb Allison, Timothy B

RE: 2.0.6 release ?

2017-05-08 Thread Allison, Timothy B.
Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, May 8, 2017 10:01 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.: > Happy to. Will kick off now? Yes Tilman > > -Original Message- > From: Tilman Haus

low priority: proxy settings and unit tests?

2017-05-08 Thread Allison, Timothy B.
All, Apologies for this one... Is there an easy way to set proxy information for the unit tests that get an InputStream via URL without changing any source code or project poms? In Intellij, I can modify the program arguments for each one, but then, of course, maven doesn't pick up that

RE: 2.0.6 release ?

2017-05-08 Thread Allison, Timothy B.
Happy to. Will kick off now? -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, May 6, 2017 10:02 AM To: dev@pdfbox.apache.org Subject: Re: 2.0.6 release ? Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler: > Am 02.05.2017 um 12:42 schrieb Andreas

jai-imageio-core -- BSD-3 with nuclear clause

2017-04-27 Thread Allison, Timothy B.
PDFBox colleagues, On TIKA-2338, we're considering incorporating jai-imageio-core into Tika (removing the "provided" scope) because the authors on github claim that they've removed the non-ASL 2.0 parts out of jai-imageio-core. We noticed, though, that this is BSD-3 with the nuclear

RE: [VOTE] Release Apache PDFBox 2.0.5

2017-03-16 Thread Allison, Timothy B.
tions.xlsx, then looking only > at govdocs there, all are similar or better. > > Tilman > > Am 15.03.2017 um 00:03 schrieb Allison, Timothy B.: >> +1 >> >> I ran a comparison with 2.0.5-rc1 and (I think) 2.0.4 against ~500k >> files from our regression corpus. >

RE: [VOTE] Release Apache PDFBox 2.0.5

2017-03-14 Thread Allison, Timothy B.
+1 I ran a comparison with 2.0.5-rc1 and (I think) 2.0.4 against ~500k files from our regression corpus. I haven't had a chance to do much digging, but I wanted to share what I had as soon as I had it. Reports are here:

tika-eval

2017-02-17 Thread Allison, Timothy B.
All, I finally got around to adding tika-eval[1] to Apache Tika. If you have any interest in comparing the output of different tools/versions/parameters on text extraction, give it a try. You don't need to use Tika or format the output in a specific format; plain UTF-8 text will work.

RE: [VOTE] Release Apache PDFBox 2.0.4

2016-12-12 Thread Allison, Timothy B.
+1 Comparisons available here: http://162.242.228.174/reports/reports_pdfbox_2_0_3_vs_2_0_4-rc1.tar.bz2 No new exceptions, a few fixed exceptions, better content extraction. Thank you, all! -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Monday, December

RE: New releases

2016-12-12 Thread Allison, Timothy B.
Or, turns out the 12th...ugh. I just kicked off the regression tests. Should have results within 8 hours. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, November 29, 2016 3:36 PM To: dev@pdfbox.apache.org Subject: RE: New releases +1 I

FW: ApacheCon Miami is coming in May.

2016-11-30 Thread Allison, Timothy B.
> ApacheCon and Apache Big Data will be held at the Intercontinental in Miami, > Florida, May 16-18, 2017 I plan to attend. Who's in? Any interest in collaborating on a talk or submitting your own? Cheers, Tim -Original Message- From: Rich Bowen [mailto:rbo...@apache.org]

RE: New releases

2016-11-29 Thread Allison, Timothy B.
+1 I should have time to run the regression tests against 2.0.x the week of the 5th. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, November 29, 2016 2:21 AM To: dev@pdfbox.apache.org Subject: Re: New releases Am 28.11.2016 um 21:38 schrieb

Apache Tika's public regression corpus

2016-10-05 Thread Allison, Timothy B.
All, I recently blogged about some of the work we're doing with a large scale regression corpus to make Tika, POI and PDFBox more robust and to identify regressions before release. If you'd like to chip in with recommendations, requests or Hadoop/Spark clusters (why not shoot for the stars),

RE: New PDFBox Committer

2016-09-19 Thread Allison, Timothy B.
Thank you, all! I am honored to join your ranks! Cheers, Tim -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Monday, September 19, 2016 7:55 AM To: dev@pdfbox.apache.org Subject: New PDFBox Committer Hi, I'm happy to announce that the PDFBox

RE: PDFBox 2.0.3 TIKA comparison

2016-09-15 Thread Allison, Timothy B.
8 schrieb Allison, Timothy B.: >>> >>> >>> There are some regressions in content extraction, but overall, >>> content extraction looks to have improved quite a bit. Looks like >>> ~2 million more "common English words" via Tilman's methodolo

RE: PDFBox 2.0.3 TIKA comparison

2016-09-15 Thread Allison, Timothy B.
Perfect. Thank you! -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Thursday, September 15, 2016 8:31 AM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3 TIKA comparison Am 15.09.2016 um 13:52 schrieb Allison, Timothy B.: >> The one apparent maj

RE: PDFBox 2.0.3 TIKA comparison

2016-09-15 Thread Allison, Timothy B.
If this doesn't look like something you've recently fixed, I can rerun with the actual 2.0.3-rc1 (only on pdfs!) and see if I'm still getting this exception. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, September 15, 2016 7:53 AM To: dev

RE: PDFBox 2.0.3 TIKA comparison

2016-09-15 Thread Allison, Timothy B.
> The one apparent major new exception for PDF files was apparently fixed > before 2.0.3. So, please ignore that one! Wait...if possible, please confirm that you did fix this recently (within the last week or two). I ran pdfbox app's (2.0.3) on a handful of triggering files and didn't get

RE: PDFBox 2.0.3 TIKA comparison

2016-09-14 Thread Allison, Timothy B.
From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Wednesday, September 14, 2016 2:50 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3 TIKA comparison > Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.: >> >> >> There are some regressions in content extrac

RE: PDFBox 2.0.3 TIKA comparison

2016-09-14 Thread Allison, Timothy B.
, September 14, 2016 12:52 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3 TIKA comparison Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.: > https://github.com/tballison/share/blob/master/tika_comparisons/report > s_tika_20160904_dev.zip > > This run was against the full corp

RE: PDFBox 2.0.3?

2016-09-14 Thread Allison, Timothy B.
Tim -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Monday, September 12, 2016 12:58 PM To: dev@pdfbox.apache.org Subject: Re: PDFBox 2.0.3? Am 12.09.2016 um 18:47 schrieb Allison, Timothy B.: > Let me know if/when to run a comparison between 2.

RE: PDFBox 2.0.3?

2016-09-12 Thread Allison, Timothy B.
Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/ Tika 1.13). Cheers, Tim

PDFBox 2.0.3?

2016-08-11 Thread Allison, Timothy B.
PDFBox Colleagues, We may be heading towards a release of Tika 1.14 over the next month, maybe early September. Any plans for a PDFBox 2.0.3 release before then? I'm happy to recommend to my Tika-colleagues a delay if you would naturally be releasing somewhere around then. Best,

FW: Apache Tika used to parse the Panama papers!

2016-04-06 Thread Allison, Timothy B.
Looks like quite a few PDFs [0]... Couldn't have done it without you! Cheers, Tim P.S. Tip of the hat to Andreas for rt the link! [0] https://twitter.com/bigdata/status/717346207312392192 -Original Message- From: Mattmann, Chris A (3980)

RE: shading/relocating 1.8.x?

2016-03-29 Thread Allison, Timothy B.
, 2016 7:12 AM To: dev@pdfbox.apache.org Subject: RE: shading/relocating 1.8.x? > "Allison, Timothy B." <talli...@mitre.org> hat am 28. März 2016 um > 21:02 > geschrieben: > > > Oh, wow, so it really might be possible without too much work? I'm > more

RE: shading/relocating 1.8.x?

2016-03-28 Thread Allison, Timothy B.
: shading/relocating 1.8.x? Am 25.03.2016 um 17:39 schrieb John Hewson: > >> On 23 Mar 2016, at 06:20, Allison, Timothy B. <talli...@mitre.org> wrote: >> >> All, >> We've upgraded to 2.0.0 on Tika. Many thanks again! >> One of our users is interest

RE: shading/relocating 1.8.x?

2016-03-28 Thread Allison, Timothy B.
See: https://issues.apache.org/jira/browse/TIKA-1285?focusedCommentId=15214111=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15214111 -Original Message- From: John Hewson [mailto:j...@jahewson.com] Sent: Friday, March 25, 2016 1:03 PM To:

RE: shading/relocating 1.8.x?

2016-03-25 Thread Allison, Timothy B.
Hi John, Normally, I'd agree. And, y, I've been extremely grateful for the effort put into dealing with noisy PDFs in 2.0.0. However, I think that the Tika user requesting this is interested in getting what he can from truncated and truly broken files -- e.g. Common Crawl data which (I

shading/relocating 1.8.x?

2016-03-23 Thread Allison, Timothy B.
All, We've upgraded to 2.0.0 on Tika. Many thanks again! One of our users is interested in continuing to use the classic/SequentialParser, or at least having it available as a back-off parser for corrupt pdfs [0]. Would you be willing to distribute a shaded/relocated 1.8.x app so that we

RE: The Apache® Software Foundation announces Apache PDFBox™ v2.0

2016-03-21 Thread Allison, Timothy B.
Congratulations! And, thank you! Cheers, Tim -Original Message- From: Andreas Lehmkühler [mailto:andr...@lehmi.de] Sent: Monday, March 21, 2016 10:11 AM To: us...@pdfbox.apache.org Subject: Fwd: The Apache® Software Foundation announces Apache PDFBox™ v2.0

RE: roadmap for XMPBox?

2016-03-11 Thread Allison, Timothy B.
@pdfbox.apache.org Subject: Re: roadmap for XMPBox? Hi all As a third option: What about the BSD-licensed Adobe XMP Toolkit? At least verapdf seems to use a fork it: https://github.com/veraPDF/veraPDF-xmp Cheers, beat Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.: > All, > > >

RE: roadmap for XMPBox?

2016-03-08 Thread Allison, Timothy B.
2.0.0, obviously... :) -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Tuesday, March 08, 2016 12:56 PM To: dev@pdfbox.apache.org Subject: Re: roadmap for XMPBox? Am 08.03.2016 um 18:44 schrieb Allison, Timothy B.: > Got it. Thank you. I wanted t

RE: roadmap for XMPBox?

2016-03-08 Thread Allison, Timothy B.
/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf And no, there are no plans for anything on XMP at this time... Tilman Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.: > All, > > > >When we migrate to PDFBox 2.x over on Tika, I'd much p

roadmap for XMPBox?

2016-03-07 Thread Allison, Timothy B.
All, When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox. I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs. I’m

RE: [VOTE] Release Apache PDFBox 1.8.11

2016-01-12 Thread Allison, Timothy B.
ou using for your tests? I ran into some problems (test failures during rendering) whenever using the openjdk which comes with fedora by default. Those disappear once I switch to oracle jdk. BR, Andreas Am 12.01.2016 um 18:58 schrieb Allison, Timothy B.: > All, > > Is this user error?

RE: [VOTE] Release Apache PDFBox 1.8.11

2016-01-12 Thread Allison, Timothy B.
All, Is this user error? I'm getting 3 test exceptions in both Windows and Linux in the preflight module after I did an svn checkout from: http://svn.apache.org/repos/asf/pdfbox/tags/1.8.11/ Revision: 1724292 Node Kind: directory Schedule: normal Last Changed Author: lehmi Last Changed Rev:

comparison of 1.8.10 and 2.0 trunk

2015-10-23 Thread Allison, Timothy B.
All, Apologies for the delay. I finally finished the comparison of text extracted from 100k pdfs with 1.8.10 and 2.0 trunk (pdfbox-2.0.0-20151022.051152-1783). The reports are available here [0]. I botched the commit message... I haven't had a chance to review the results. The eval code

RE: Subclassing BaseParser?

2015-10-05 Thread Allison, Timothy B.
13:25 schrieb Allison, Timothy B.: > > [switching to dev because this is entering into dev land] > > Y, I did and do have it working for the 1.8.x branch. I either had it > working for the 2.0 branch before the change to SequentialSource was > made, or there's a chance that

RE: Subclassing BaseParser?

2015-10-05 Thread Allison, Timothy B.
13 schrieb Allison, Timothy B.: All, I'm probably suffering from the same failure that led to (https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370), but is it possible to subclass BasePars

RE: help debugging integration of PDFBox 2.0.0 trunk

2015-07-20 Thread Allison, Timothy B.
it sometimes occurs to me that Java crashed in a native font library. However with 2.x and Java 1.7 I had also crashes in a native Java library. Best, Timo Am 20.07.2015 um 18:12 schrieb Allison, Timothy B.: All, While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm

help debugging integration of PDFBox 2.0.0 trunk

2015-07-20 Thread Allison, Timothy B.
All, While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm finding two issues that are difficult to reproduce. Background: Tika-batch has a parent process that kicks off a Tika processor in a child process, if that dies unexpectedly, the parent kicks it off again. I'm

RE: help debugging integration of PDFBox 2.0.0 trunk

2015-07-20 Thread Allison, Timothy B.
Xmx doesn't limit native memory, so if there's a leak associated with AWT, ImageIO C libraries, or some other JNI library, the process can grow without limit. Such a leak could be due to a bug, or us not calling close() somewhere. Got it. Ok. Is there anything I can do to help figure out

RE: help debugging integration of PDFBox 2.0.0 trunk

2015-07-20 Thread Allison, Timothy B.
@pdfbox.apache.org Subject: Re: help debugging integration of PDFBox 2.0.0 trunk Am 20.07.2015 um 18:12 schrieb Allison, Timothy B.: All, While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm finding two issues that are difficult to reproduce. Background: Tika-batch has a parent

RE: help debugging integration of PDFBox 2.0.0 trunk

2015-07-20 Thread Allison, Timothy B.
Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, July 20, 2015 3:18 PM To: dev@pdfbox.apache.org Subject: RE: help debugging integration of PDFBox 2.0.0 trunk Y, sorry, Tilman. I'm not running into problems with 1.8.9 and straight text extraction, though

RE: first stack trace report from pdfbox 2.0.0 trunk

2015-07-15 Thread Allison, Timothy B.
On 14 Jul 2015, at 13:49, Tilman Hausherr thaush...@t-online.de wrote: Am 14.07.2015 um 22:35 schrieb John Hewson: On 14 Jul 2015, at 13:20, Tilman Hausherr thaush...@t-online.de wrote: Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.: Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029

RE: PDFBox 1.8.10 release

2015-07-15 Thread Allison, Timothy B.
Initial run on 1.8.10 is posted here: https://issues.apache.org/jira/browse/TIKA-1588 Results: no surprises That run was done before PDFBOX-2853 was completed. Rerun now or wait for more changes in 1.8.10? -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de]

RE: first stack trace report from pdfbox 2.0.0 trunk

2015-07-14 Thread Allison, Timothy B.
, it applies to 029423 but also to other files. Tilman Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.: All, I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700

RE: PDFBox 1.8.10 release

2015-07-10 Thread Allison, Timothy B.
I'll kick off 1.8.9 now so that we have it as comparison when 1.8.10-rc1 is ready. Please ping me on https://issues.apache.org/jira/browse/TIKA-1588 if you don't hear back from me on this list when rc1 is ready. -Original Message- From: Tilman Hausherr [mailto:thaush...@t-online.de]

first stack trace report from pdfbox 2.0.0 trunk

2015-07-10 Thread Allison, Timothy B.
All, I just posted the first stacktrace report from my initial partial batch run of against govdocs1 here: https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip Caveats/Notes The run yesterday did not include the fixes that were made in PDFBOX-2370 or

  1   2   >