> PDFBOX-4153 is solved. How about cutting the release next Monday?
+1 and thank you!
Tim
Reports are available here:
http://162.242.228.174/reports/pdfbox-2.0.9-pre-rc1_reports_2.tar.bz2
> ok => Tim, please start again
Will start now.
I'm happy to run the regression tests again when all final changes for
2.0.9-RC1 are made. I'm really excited to be able to include jbig2. We'll
start the Tika release process for 1.18 as soon as PDFBox 2.0.9 is available.
Thank you, all!
Cheers,
Tim
, 2018 3:52 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.9 release?
Am 08.03.2018 um 21:35 schrieb Allison, Timothy B.:
> I've kicked off an initial run with 2.0.9-SNAPSHOT on the regression corpus.
> While I had some time, I wanted to see if there were any early indicators of
>
is really, truly ready for rc1.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, March 7, 2018 8:03 AM
To: dev@pdfbox.apache.org
Subject: RE: PDFBox 2.0.9 release?
Argh. Sorry for my delay. Y. I have time, and I'm happy to help Tilman if
he'd
Argh. Sorry for my delay. Y. I have time, and I'm happy to help Tilman if
he'd prefer to lead the regression testing process again.
Cheers,
Tim
-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Monday, March 5, 2018 1:28 PM
To:
that was on the server.
>>
>> I'm thinking of creating a new "A" with 2.0.4 with current tika trunk
>> and then compare with the "B" I did.
>>
>> Tilman
>>
>>
>> Am 03.11.2017 um 22:14 schrieb Tilman Hausherr:
>>> Am 03.11.20
Tilman,
Thank you for the toe-stubbing. I'm sorry that it wasn't easier...
I created a new user with collab permissions and ran through the process.
You are right about the privileges on the tmp directory... POI needs a tmp
directory to write xlsx. I created a tmp directory in /work/eval
Sorry. Fixed.
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, October 31, 2017 6:08 PM
To: dev@pdfbox.apache.org
Subject: Re: Running tika-eval on the Rackspace vm
Am 31.10.2017 um 20:53 schrieb Allison, Timothy B.:
>> It's not possible to
> It's not possible to rename / remove the files / directories mentioned in
> part 1 due to not having the permissions.
Gah. Sorry. Tilman, I added you to "collab" and chgrp to collab on /work
/data2/docs /data3/batch_runs and /data4/batch_runs.
> The directory is named batch-apps, not
version - is this the "good" version, so I could simply
> download tika-app and put it there? Or just build tika with a specific
> PDFBox version?
>
> Tilman
>
> Am 23.10.2017 um 20:54 schrieb Allison, Timothy B.:
>> All,
>>
>> If anyone would l
+1
Thank you, Andreas, Tilman, and team!
Cheers,
Tim
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Monday, October 30, 2017 3:57 PM
To: dev@pdfbox.apache.org
Subject: Re: [VOTE] Release Apache PDFBox 2.0.8
Am 30.10.2017 um 19:47 schrieb Andreas
cycles,
> you might kick of another run.
>
> Andreas
>
> Am 23.10.2017 um 20:11 schrieb Allison, Timothy B.:
>> Reports here:
>> http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz
>>
>> I haven't looked yet.
>>
>> -Original
expect any new regressions, but if you have some cycles, you might
kick of another run.
Andreas
Am 23.10.2017 um 20:11 schrieb Allison, Timothy B.:
> Reports here:
> http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take4.tar.gz
>
> I haven't looked yet.
>
> -Original M
All,
If anyone would like to join the fun in running tika-eval on the Rackspace vm,
I posted this: https://wiki.apache.org/tika/TikaEvalOnVM . You’ll need access
to the vm, of course, but I’m happy to grant that to anyone who wants to chip
in and help with regression tests. There are some
open regression in 2.0.8, Tilmans test run hasn't
showed any regression. Please re-run your tests again to see if we can proceed
with 2.0.8, I'd really like to push it out.
TIA again,
Andreas
Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler:
> Am 03.10.2017 um 15:38 schrieb Allison, Timoth
-run your tests again to see if we can proceed
with 2.0.8, I'd really like to push it out.
TIA again,
Andreas
Am 08.10.2017 um 16:11 schrieb Andreas Lehmkuehler:
> Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.:
>>
>>> And yes, we need another regressions run if possible
> However, PDFBox 2.0.8-SNAPSHOT has a more 0, 1, 2 and 3s...
>
> The TOP_10_MORE_IN_B column in the contents report shows that there are 15
> more 0's, 15 more 1's 11 more '2's etc.
>
> 0: 15 | 1: 15 | 2: 11 | 20: 5 | 3: 2 | 4: 2
>Yeah but where do they come from? Not from the pure text
: 5 | 3: 2 | 4: 2
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, October 10, 2017 11:47 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?
Am 09.10.2017 um 22:26 schrieb Allison, Timothy B.:
> Thank you, Andreas, for fixing the slow parse on corr
]
Sent: Tuesday, October 10, 2017 11:47 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?
Am 09.10.2017 um 22:26 schrieb Allison, Timothy B.:
> Thank you, Andreas, for fixing the slow parse on corrupt file so quickly!
>
> Reports are here:
> http://162.242.228.174/reports/pdfbox_2_0_7_Vs
-----
From: Allison, Timothy B.
Sent: Monday, October 9, 2017 4:26 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.8?
Thank you, Andreas, for fixing the slow parse on corrupt file so quickly!
Reports are here:
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz
-Original Message
Thank you, Andreas, for fixing the slow parse on corrupt file so quickly!
Reports are here:
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take3.tar.gz
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, October 9, 2017 8:02 AM
To: dev
Starting process now.
-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Sunday, October 8, 2017 10:12 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.8?
Am 03.10.2017 um 15:38 schrieb Allison, Timothy B.:
>
>> And yes, we need another regres
>And yes, we need another regressions run if possible
Sounds good. Will do once I hear that we're good to go. Thank you!
-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:
>>Let me know when we're ready for another round.
>I've already started ...
RC2? No need for another regression run?
Thank you again!
All,
Again, my apologies for post-useful/late results! Ugh...
Thank you, Andreas and Tilman!
Let me know when we're ready for another round.
Cheers,
Tim
-Original Message-
From: Andreas Lehmkühler (JIRA) [mailto:j...@apache.org]
Sent: Tuesday,
> Re 308576.pdf: the text extraction has a huge loss, but a manual check shows
> it is identical. However that file has the NPE from PDActionURI.getURI(),
> could it be that this results in an abort of text extraction?
Same for 569017.pdf.
Likely. There are two "per file pair contents" files.
Sorry all for taking longer than expected! File under "this information would
have been useful..." ☹
-Original Message-----
From: Allison, Timothy B.
Sent: Monday, October 2, 2017 3:59 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.8?
Reports are here:
http://162.242.228.1
Reports are here:
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8_take2.tar.gz
Looks like some new NPEs. I'll take a look at the metadata diffs.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, October 2, 2017 9:24 AM
To: dev@pdfbox.apache.org
Sounds good.
I kicked off the eval process yesterday, but because of a bug in our
config-file reader and/or user error in modifying the config file, I wound up
with 500k pdfs parsed by our EmptyParserno results.
I restarted the eval process just now. I should have results in 6 hours.
> I'd go for postponing in order to fix that regression - what about setting
> the date to next Monday?
+1 I’m happy pushing it out later if the fix happens >= Friday and we want to
run the full regression tests again.
Thank you, Andreas!
http://162.242.228.174/reports/pdfbox_2_0_7_Vs_2_0_8-SNAPSHOT_reports.tar.gz
is now available. I haven't yet had a chance to look at either...
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, September 18, 2017 12:51 PM
To: dev@pdfbox.apache.org
anything pending which should be included?
>
> How about cutting the release in a week or two from now?
>
> @Tim please run a test 2.0.7 vs. 2.0.8 if possible
>
> Andreas
>
> Am 11.09.2017 um 23:24 schrieb Allison, Timothy B.:
>>> I hope there aren't any new
> because I'm ill but I expect to be my old self later this week.
I'm sorry to hear it! I hope that you are feeling better soon!
> I'd also like to have a test from version 2.0.4 compared to trunk because
> 2.0.5 was the version were the tests weren't done, the problems were fixed in
> 2.0.6
>I hope there aren't any new regressions.
Happy to help find them! :)
On a related note, do we have a sense of the schedule for PDFBox 2.0.8? I'd
like to include it in Tika's last Java 7 release...end of Sept, middle of Oct.,
or whenever 2.0.8 is out. :)
-Original Message-
From:
-terminal interactive form
fields not handled recursively
Hi Tim,
> Am 15.08.2017 um 17:31 schrieb Allison, Timothy B. <talli...@mitre.org>:
>
> All,
> I can't tell if the triggering file is corrupt or how we want to handle it
> on the PDFBox side. The problem is
All,
I can't tell if the triggering file is corrupt or how we want to handle it on
the PDFBox side. The problem is that the parent node is a PDTextField -- a
PDTerminalField -- so we don't/can't look for children, even though it actually
does have pointers in Kids.
The output from
All,
> If anyone is interested in using the detected MIME types or anything else
> from Common Crawl - I'm happy to help! The URL index [4] contains now a new
> field "mime-detected" which makes it easy to search or grep for confusion
> pairs.
This is an amazing step forward for sampling PDF
Ha. I hadn't realized the video was available until this post. Thank you!
> And here is the talk about it Tim gave at ApacheCon
>
> https://youtu.be/vRPTPMwI53k?list=PLbzoR-pLrL6pLDCyPxByWQwYTL-JrF5Rp
>
> I've enjoyed it (the video).
So did I!
Tilman
+1
Thank you!
-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Friday, May 12, 2017 12:13 PM
To: dev@pdfbox.apache.org
Subject: [VOTE] Release Apache PDFBox 2.0.6
Hi,
a candidate for the PDFBox 2.0.6 release is available at:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170512.tar.gz
Looks good to me on a very cursory look.
> It isn't that secret as Tim posted it somewhere in this thread
:)
I've added throttling to httpd (I think) so we should be ok, and y, the address
is out in the open now.
Let me know if I should kick off another run.
Thank you, all!
Haven't had a chance to look. Reports are here:
http://162.242.228.174/reports/reports_pdfbox_2_0_6_20170510.tar.gz
I won't have results immediately. :)
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 4:13 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?
Am 09.05.2017 um 22:03 schrieb Allison, Timothy B.:
> UGH. I'm so wrong.
. But after removing it, it still works
with the three files... so the question is, can this parameter ever be null, or
not?
Tilman
Am 09.05.2017 um 21:34 schrieb Allison, Timothy B.:
> Should we return false if the node is null in PDPageTree#isPageTreeNode (66
> new NPE exceptions)
With lots of empty pages...
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, May 9, 2017 3:57 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 release ?
Doh. AR can't open it. Sorry. Chrome appears to be able to open it.
-Original Message
Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 3:20 PM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?
Am 09.05.2017 um 21:01 schrieb Allison, Timothy B.:
>> I've fixed all remaining regression tickets (in the end it was
>> exactl
>I've fixed all remaining regression tickets (in the end it was exactly 1)
Great! Thank you!
Let me know when I should kick off another eval.
-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands,
Added a page count comparison report under "content/":
http://162.242.228.174/reports/reports_pdfbox_2_0_6c.tar.gz
-Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, May 9, 2017 2:39 PM
To: dev@pdfbox.apache.org
Subject: RE: 2.0.6 relea
]
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?
Am 09.05.2017 um 02:43 schrieb Allison, Timothy B.:
> Tilman's initial recommendation
Can you do me another favor? Have a column with the size in any table that is
about individual files. I th
Y. Will do. Meetings beckon, so it will take a few hours. :(
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, May 9, 2017 10:07 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?
Am 09.05.2017 um 02:43 schrieb Allison, Timothy B
Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Monday, May 8, 2017 10:01 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?
Am 08.05.2017 um 15:06 schrieb Allison, Timothy B.:
> Happy to. Will kick off now?
Yes
Tilman
>
> -Original Message-
> From: Tilman Haus
All,
Apologies for this one... Is there an easy way to set proxy information for
the unit tests that get an InputStream via URL without changing any source code
or project poms? In Intellij, I can modify the program arguments for each one,
but then, of course, maven doesn't pick up that
Happy to. Will kick off now?
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Saturday, May 6, 2017 10:02 AM
To: dev@pdfbox.apache.org
Subject: Re: 2.0.6 release ?
Am 04.05.2017 um 18:10 schrieb Andreas Lehmkuehler:
> Am 02.05.2017 um 12:42 schrieb Andreas
PDFBox colleagues,
On TIKA-2338, we're considering incorporating jai-imageio-core into Tika
(removing the "provided" scope) because the authors on github claim that
they've removed the non-ASL 2.0 parts out of jai-imageio-core.
We noticed, though, that this is BSD-3 with the nuclear
tions.xlsx, then looking only
> at govdocs there, all are similar or better.
>
> Tilman
>
> Am 15.03.2017 um 00:03 schrieb Allison, Timothy B.:
>> +1
>>
>> I ran a comparison with 2.0.5-rc1 and (I think) 2.0.4 against ~500k
>> files from our regression corpus.
>
+1
I ran a comparison with 2.0.5-rc1 and (I think) 2.0.4 against ~500k files from
our regression corpus.
I haven't had a chance to do much digging, but I wanted to share what I had as
soon as I had it.
Reports are here:
All,
I finally got around to adding tika-eval[1] to Apache Tika. If you have any
interest in comparing the output of different tools/versions/parameters on text
extraction, give it a try. You don't need to use Tika or format the output in
a specific format; plain UTF-8 text will work.
+1
Comparisons available here:
http://162.242.228.174/reports/reports_pdfbox_2_0_3_vs_2_0_4-rc1.tar.bz2
No new exceptions, a few fixed exceptions, better content extraction. Thank
you, all!
-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Monday, December
Or, turns out the 12th...ugh. I just kicked off the regression tests. Should
have results within 8 hours.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, November 29, 2016 3:36 PM
To: dev@pdfbox.apache.org
Subject: RE: New releases
+1
I
> ApacheCon and Apache Big Data will be held at the Intercontinental in Miami,
> Florida, May 16-18, 2017
I plan to attend.
Who's in? Any interest in collaborating on a talk or submitting your own?
Cheers,
Tim
-Original Message-
From: Rich Bowen [mailto:rbo...@apache.org]
+1
I should have time to run the regression tests against 2.0.x the week of the
5th.
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, November 29, 2016 2:21 AM
To: dev@pdfbox.apache.org
Subject: Re: New releases
Am 28.11.2016 um 21:38 schrieb
All,
I recently blogged about some of the work we're doing with a large scale
regression corpus to make Tika, POI and PDFBox more robust and to identify
regressions before release. If you'd like to chip in with recommendations,
requests or Hadoop/Spark clusters (why not shoot for the stars),
Thank you, all! I am honored to join your ranks!
Cheers,
Tim
-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Monday, September 19, 2016 7:55 AM
To: dev@pdfbox.apache.org
Subject: New PDFBox Committer
Hi,
I'm happy to announce that the PDFBox
8 schrieb Allison, Timothy B.:
>>>
>>>
>>> There are some regressions in content extraction, but overall,
>>> content extraction looks to have improved quite a bit. Looks like
>>> ~2 million more "common English words" via Tilman's methodolo
Perfect. Thank you!
-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de]
Sent: Thursday, September 15, 2016 8:31 AM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison
Am 15.09.2016 um 13:52 schrieb Allison, Timothy B.:
>> The one apparent maj
If this doesn't look like something you've recently fixed, I can rerun with the
actual 2.0.3-rc1 (only on pdfs!) and see if I'm still getting this exception.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, September 15, 2016 7:53 AM
To: dev
> The one apparent major new exception for PDF files was apparently fixed
> before 2.0.3. So, please ignore that one!
Wait...if possible, please confirm that you did fix this recently (within the
last week or two). I ran pdfbox app's (2.0.3) on a handful of triggering files
and didn't get
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Wednesday, September 14, 2016 2:50 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison
> Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
>>
>>
>> There are some regressions in content extrac
, September 14, 2016 12:52 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3 TIKA comparison
Am 14.09.2016 um 18:38 schrieb Allison, Timothy B.:
> https://github.com/tballison/share/blob/master/tika_comparisons/report
> s_tika_20160904_dev.zip
>
> This run was against the full corp
Tim
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Monday, September 12, 2016 12:58 PM
To: dev@pdfbox.apache.org
Subject: Re: PDFBox 2.0.3?
Am 12.09.2016 um 18:47 schrieb Allison, Timothy B.:
> Let me know if/when to run a comparison between 2.
Let me know if/when to run a comparison between 2.0.3 and 2.0.1 (shipped w/
Tika 1.13).
Cheers,
Tim
PDFBox Colleagues,
We may be heading towards a release of Tika 1.14 over the next month, maybe
early September. Any plans for a PDFBox 2.0.3 release before then? I'm happy
to recommend to my Tika-colleagues a delay if you would naturally be releasing
somewhere around then.
Best,
Looks like quite a few PDFs [0]...
Couldn't have done it without you!
Cheers,
Tim
P.S. Tip of the hat to Andreas for rt the link!
[0] https://twitter.com/bigdata/status/717346207312392192
-Original Message-
From: Mattmann, Chris A (3980)
, 2016 7:12 AM
To: dev@pdfbox.apache.org
Subject: RE: shading/relocating 1.8.x?
> "Allison, Timothy B." <talli...@mitre.org> hat am 28. März 2016 um
> 21:02
> geschrieben:
>
>
> Oh, wow, so it really might be possible without too much work? I'm
> more
: shading/relocating 1.8.x?
Am 25.03.2016 um 17:39 schrieb John Hewson:
>
>> On 23 Mar 2016, at 06:20, Allison, Timothy B. <talli...@mitre.org> wrote:
>>
>> All,
>> We've upgraded to 2.0.0 on Tika. Many thanks again!
>> One of our users is interest
See:
https://issues.apache.org/jira/browse/TIKA-1285?focusedCommentId=15214111=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15214111
-Original Message-
From: John Hewson [mailto:j...@jahewson.com]
Sent: Friday, March 25, 2016 1:03 PM
To:
Hi John,
Normally, I'd agree. And, y, I've been extremely grateful for the effort put
into dealing with noisy PDFs in 2.0.0. However, I think that the Tika user
requesting this is interested in getting what he can from truncated and truly
broken files -- e.g. Common Crawl data which (I
All,
We've upgraded to 2.0.0 on Tika. Many thanks again!
One of our users is interested in continuing to use the
classic/SequentialParser, or at least having it available as a back-off parser
for corrupt pdfs [0].
Would you be willing to distribute a shaded/relocated 1.8.x app so that we
Congratulations! And, thank you!
Cheers,
Tim
-Original Message-
From: Andreas Lehmkühler [mailto:andr...@lehmi.de]
Sent: Monday, March 21, 2016 10:11 AM
To: us...@pdfbox.apache.org
Subject: Fwd: The Apache® Software Foundation announces Apache PDFBox™ v2.0
@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?
Hi all
As a third option: What about the BSD-licensed Adobe XMP Toolkit? At least
verapdf seems to use a fork it: https://github.com/veraPDF/veraPDF-xmp
Cheers, beat
Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
2.0.0, obviously... :)
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
Sent: Tuesday, March 08, 2016 12:56 PM
To: dev@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?
Am 08.03.2016 um 18:44 schrieb Allison, Timothy B.:
> Got it. Thank you. I wanted t
/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf
And no, there are no plans for anything on XMP at this time...
Tilman
Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
>When we migrate to PDFBox 2.x over on Tika, I'd much p
All,
When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch from
our current reliance on jempbox to XMPBox. I recently extracted ~70k XMPs from
PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were
exceptions on roughly 40% of the XMPs.
I’m
ou using for your tests? I ran
into some problems (test failures during rendering) whenever using the openjdk
which comes with fedora by default. Those disappear once I switch to oracle jdk.
BR,
Andreas
Am 12.01.2016 um 18:58 schrieb Allison, Timothy B.:
> All,
>
> Is this user error?
All,
Is this user error? I'm getting 3 test exceptions in both Windows and Linux in
the preflight module after I did an svn checkout from:
http://svn.apache.org/repos/asf/pdfbox/tags/1.8.11/
Revision: 1724292
Node Kind: directory
Schedule: normal
Last Changed Author: lehmi
Last Changed Rev:
All,
Apologies for the delay. I finally finished the comparison of text extracted
from 100k pdfs with 1.8.10 and 2.0 trunk (pdfbox-2.0.0-20151022.051152-1783).
The reports are available here [0]. I botched the commit message...
I haven't had a chance to review the results. The eval code
13:25 schrieb Allison, Timothy B.:
>
> [switching to dev because this is entering into dev land]
>
> Y, I did and do have it working for the 1.8.x branch. I either had it
> working for the 2.0 branch before the change to SequentialSource was
> made, or there's a chance that
13 schrieb Allison, Timothy B.:
All,
I'm probably suffering from the same failure that led to
(https://issues.apache.org/jira/browse/TIKA-1678?focusedCommentId=14640370=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14640370),
but is it possible to subclass BasePars
it sometimes occurs
to me that Java crashed in a native font library. However with 2.x and
Java 1.7 I had also crashes in a native Java library.
Best,
Timo
Am 20.07.2015 um 18:12 schrieb Allison, Timothy B.:
All,
While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm
All,
While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm
finding two issues that are difficult to reproduce.
Background:
Tika-batch has a parent process that kicks off a Tika processor in a child
process, if that dies unexpectedly, the parent kicks it off again. I'm
Xmx doesn't limit native memory, so if there's a leak associated with AWT,
ImageIO C libraries, or some other JNI library, the process can grow without
limit. Such a leak could be due to a bug, or us not calling close() somewhere.
Got it. Ok. Is there anything I can do to help figure out
@pdfbox.apache.org
Subject: Re: help debugging integration of PDFBox 2.0.0 trunk
Am 20.07.2015 um 18:12 schrieb Allison, Timothy B.:
All,
While integrating 2.0.0 trunk into Tika and running against govdocs1, I'm
finding two issues that are difficult to reproduce.
Background:
Tika-batch has a parent
Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Monday, July 20, 2015 3:18 PM
To: dev@pdfbox.apache.org
Subject: RE: help debugging integration of PDFBox 2.0.0 trunk
Y, sorry, Tilman. I'm not running into problems with 1.8.9 and straight text
extraction, though
On 14 Jul 2015, at 13:49, Tilman Hausherr thaush...@t-online.de wrote:
Am 14.07.2015 um 22:35 schrieb John Hewson:
On 14 Jul 2015, at 13:20, Tilman Hausherr thaush...@t-online.de wrote:
Am 14.07.2015 um 21:37 schrieb Allison, Timothy B.:
Interesting, yes: 781/781172.pdf, 490/490376.pdf and 029
Initial run on 1.8.10 is posted here:
https://issues.apache.org/jira/browse/TIKA-1588
Results: no surprises
That run was done before PDFBOX-2853 was completed.
Rerun now or wait for more changes in 1.8.10?
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
, it applies to 029423 but also to other files.
Tilman
Am 10.07.2015 um 13:57 schrieb Allison, Timothy B.:
All,
I just posted the first stacktrace report from my initial partial batch
run of against govdocs1 here:
https://issues.apache.org/jira/secure/attachment/12744700
I'll kick off 1.8.9 now so that we have it as comparison when 1.8.10-rc1 is
ready. Please ping me on https://issues.apache.org/jira/browse/TIKA-1588 if
you don't hear back from me on this list when rc1 is ready.
-Original Message-
From: Tilman Hausherr [mailto:thaush...@t-online.de]
All,
I just posted the first stacktrace report from my initial partial batch run
of against govdocs1 here:
https://issues.apache.org/jira/secure/attachment/12744700/pdfbox_reports_2_0_0_20150709.zip
Caveats/Notes
The run yesterday did not include the fixes that were made in PDFBOX-2370 or
1 - 100 of 132 matches
Mail list logo