Re: shading/relocating 1.8.x?
> On 29 Mar 2016, at 04:11, Andreas Lehmkühler wrote: > >> "Allison, Timothy B." mailto:talli...@mitre.org>> hat >> am 28. März 2016 um 21:02 >> geschrieben: >> >> >> Oh, wow, so it really might be possible without too much work? I'm more than >> happy to supply examples. :) > Ups, it isn't as simply as it sounds. If we simply swallow the exception > pdfbox > most likel runs into a NPE. IMHO we have to implement some sort of an on > demand > parser which is able to handle null-values for specific parts of a pdf without > throwing any exception. One thought: instead of null it might be possible to return an empty string, empty dictionary, empty array, empty stream, etc. That way we don’t have to look for null everywhere. — John > >> Should I open an issue? > Thanks, but I'm going to do that soon, as some other things should be done as > well. > > BR > Andreas >> >> >> -Original Message- >> From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] >> Sent: Monday, March 28, 2016 10:58 AM >> To: dev@pdfbox.apache.org >> Subject: Re: shading/relocating 1.8.x? >> >> Am 25.03.2016 um 17:39 schrieb John Hewson: >>> >>>> On 23 Mar 2016, at 06:20, Allison, Timothy B. wrote: >>>> >>>> All, >>>> We've upgraded to 2.0.0 on Tika. Many thanks again! >>>> One of our users is interested in continuing to use the >>>> classic/SequentialParser, or at least having it available as a back-off >>>> parser for corrupt pdfs [0]. >>> >>> Using the old parser really isn’t a good idea, it’s known to be pretty >>> broken. I think that we would be much better off making sure the new parser >>> can handle truncated files. We already do a lot of repair in the new parser, >>> so this doesn’t seem like to much work? Maybe Andreas can comment further? >> The biggest issue here is the truncated stream or dictionary. The current >> version simply throws an exception when running into such constellations. We >> have to implement some algorithm to ignore such incomplete parts of a pdf if >> possible. >> >> BR >> Andreas >> >>> >>> Do we have some JIRA issues which identify some of these cases? >>> >>> — John >>> >>>> Would you be willing to distribute a shaded/relocated 1.8.x app so that >>>> we could load both 1.8.x and 2.0.0 in the same jvm without collisions? Or, >>>> is there a better solution? >>> >>> I wouldn’t recommend doing that, because you’re going to be stuck with using >>> 1.8 for everything, not just parsing, at least as far as corrupt/truncated >>> files are concerned. >>> >>> — John >>> >>>> Thank you! >>>> >>>> Cheers, >>>> >>>> Tim >>>> >>>> [0] >>>> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360 >>>> >>>> - >>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org >>>> >>> >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >>> For additional commands, e-mail: dev-h...@pdfbox.apache.org >>> >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: dev-h...@pdfbox.apache.org >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > <mailto:dev-unsubscr...@pdfbox.apache.org> > For additional commands, e-mail: dev-h...@pdfbox.apache.org > <mailto:dev-h...@pdfbox.apache.org>
RE: shading/relocating 1.8.x?
Got it. That's what I had assumed. I'll hold off on opening truncated file issue(s) on PDFBox's JIRA... I opened TIKA-1912 to track this on our side. Thank you, again! Best, Tim -Original Message- From: Andreas Lehmkühler [mailto:andr...@lehmi.de] Sent: Tuesday, March 29, 2016 7:12 AM To: dev@pdfbox.apache.org Subject: RE: shading/relocating 1.8.x? > "Allison, Timothy B." hat am 28. März 2016 um > 21:02 > geschrieben: > > > Oh, wow, so it really might be possible without too much work? I'm > more than happy to supply examples. :) Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox most likel runs into a NPE. IMHO we have to implement some sort of an on demand parser which is able to handle null-values for specific parts of a pdf without throwing any exception. > Should I open an issue? Thanks, but I'm going to do that soon, as some other things should be done as well. BR Andreas
RE: shading/relocating 1.8.x?
> "Allison, Timothy B." hat am 28. März 2016 um 21:02 > geschrieben: > > > Oh, wow, so it really might be possible without too much work? I'm more than > happy to supply examples. :) Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox most likel runs into a NPE. IMHO we have to implement some sort of an on demand parser which is able to handle null-values for specific parts of a pdf without throwing any exception. > Should I open an issue? Thanks, but I'm going to do that soon, as some other things should be done as well. BR Andreas > > > -Original Message- > From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] > Sent: Monday, March 28, 2016 10:58 AM > To: dev@pdfbox.apache.org > Subject: Re: shading/relocating 1.8.x? > > Am 25.03.2016 um 17:39 schrieb John Hewson: > > > >> On 23 Mar 2016, at 06:20, Allison, Timothy B. wrote: > >> > >> All, > >> We've upgraded to 2.0.0 on Tika. Many thanks again! > >> One of our users is interested in continuing to use the > >> classic/SequentialParser, or at least having it available as a back-off > >> parser for corrupt pdfs [0]. > > > > Using the old parser really isn’t a good idea, it’s known to be pretty > > broken. I think that we would be much better off making sure the new parser > > can handle truncated files. We already do a lot of repair in the new parser, > > so this doesn’t seem like to much work? Maybe Andreas can comment further? > The biggest issue here is the truncated stream or dictionary. The current > version simply throws an exception when running into such constellations. We > have to implement some algorithm to ignore such incomplete parts of a pdf if > possible. > > BR > Andreas > > > > > Do we have some JIRA issues which identify some of these cases? > > > > — John > > > >> Would you be willing to distribute a shaded/relocated 1.8.x app so that > >> we could load both 1.8.x and 2.0.0 in the same jvm without collisions? Or, > >> is there a better solution? > > > > I wouldn’t recommend doing that, because you’re going to be stuck with using > > 1.8 for everything, not just parsing, at least as far as corrupt/truncated > > files are concerned. > > > > — John > > > >> Thank you! > >> > >> Cheers, > >> > >> Tim > >> > >> [0] > >> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360 > >> > >> - > >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > >> For additional commands, e-mail: dev-h...@pdfbox.apache.org > >> > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > > For additional commands, e-mail: dev-h...@pdfbox.apache.org > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: shading/relocating 1.8.x?
Oh, wow, so it really might be possible without too much work? I'm more than happy to supply examples. :) Should I open an issue? -Original Message- From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] Sent: Monday, March 28, 2016 10:58 AM To: dev@pdfbox.apache.org Subject: Re: shading/relocating 1.8.x? Am 25.03.2016 um 17:39 schrieb John Hewson: > >> On 23 Mar 2016, at 06:20, Allison, Timothy B. wrote: >> >> All, >> We've upgraded to 2.0.0 on Tika. Many thanks again! >> One of our users is interested in continuing to use the >> classic/SequentialParser, or at least having it available as a back-off >> parser for corrupt pdfs [0]. > > Using the old parser really isn’t a good idea, it’s known to be pretty > broken. I think that we would be much better off making sure the new parser > can handle truncated files. We already do a lot of repair in the new parser, > so this doesn’t seem like to much work? Maybe Andreas can comment further? The biggest issue here is the truncated stream or dictionary. The current version simply throws an exception when running into such constellations. We have to implement some algorithm to ignore such incomplete parts of a pdf if possible. BR Andreas > > Do we have some JIRA issues which identify some of these cases? > > — John > >> Would you be willing to distribute a shaded/relocated 1.8.x app so that we >> could load both 1.8.x and 2.0.0 in the same jvm without collisions? Or, is >> there a better solution? > > I wouldn’t recommend doing that, because you’re going to be stuck with using > 1.8 for everything, not just parsing, at least as far as corrupt/truncated > files are concerned. > > — John > >> Thank you! >> >> Cheers, >> >> Tim >> >> [0] >> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360 >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org >> For additional commands, e-mail: dev-h...@pdfbox.apache.org >> > > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: shading/relocating 1.8.x?
Am 25.03.2016 um 17:39 schrieb John Hewson: On 23 Mar 2016, at 06:20, Allison, Timothy B. wrote: All, We've upgraded to 2.0.0 on Tika. Many thanks again! One of our users is interested in continuing to use the classic/SequentialParser, or at least having it available as a back-off parser for corrupt pdfs [0]. Using the old parser really isn’t a good idea, it’s known to be pretty broken. I think that we would be much better off making sure the new parser can handle truncated files. We already do a lot of repair in the new parser, so this doesn’t seem like to much work? Maybe Andreas can comment further? The biggest issue here is the truncated stream or dictionary. The current version simply throws an exception when running into such constellations. We have to implement some algorithm to ignore such incomplete parts of a pdf if possible. BR Andreas Do we have some JIRA issues which identify some of these cases? — John Would you be willing to distribute a shaded/relocated 1.8.x app so that we could load both 1.8.x and 2.0.0 in the same jvm without collisions? Or, is there a better solution? I wouldn’t recommend doing that, because you’re going to be stuck with using 1.8 for everything, not just parsing, at least as far as corrupt/truncated files are concerned. — John Thank you! Cheers, Tim [0] https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360 - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: shading/relocating 1.8.x?
See: https://issues.apache.org/jira/browse/TIKA-1285?focusedCommentId=15214111&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15214111 -Original Message- From: John Hewson [mailto:j...@jahewson.com] Sent: Friday, March 25, 2016 1:03 PM To: dev@pdfbox.apache.org Subject: Re: shading/relocating 1.8.x? > On 25 Mar 2016, at 09:44, Tilman Hausherr wrote: > > Am 25.03.2016 um 17:39 schrieb John Hewson: >> Do we have some JIRA issues which identify some of these cases? > > https://issues.apache.org/jira/browse/PDFBOX-3265 > Great! Does anyone else have some others? — John > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
RE: shading/relocating 1.8.x?
Hi John, Normally, I'd agree. And, y, I've been extremely grateful for the effort put into dealing with noisy PDFs in 2.0.0. However, I think that the Tika user requesting this is interested in getting what he can from truncated and truly broken files -- e.g. Common Crawl data which (I think) truncates files at 1MB or may have had an interrupt during download. My basic rule for opening an issue is if AR or another pdf parser can't parse it, I'm not going to ask for help. I wouldn't want to direct your all's efforts to dealing with the edge cases of truncated files. If the old PDFParser is able to get something out because it parsed sequentially, then it would be neat to be able to have that available with very little effort. In Tika, we envision allowing users to configure combinations of parsers for a given file, this would be the perfect case for the back-off-on-exception strategy -- if there's an exception with 2.0.0, try again with 1.8.x. I'll try shading/relocating next week, and see whether that works as expected. Thank you, all, again! Cheers, Tim -Original Message- From: John Hewson [mailto:j...@jahewson.com] Sent: Friday, March 25, 2016 1:03 PM To: dev@pdfbox.apache.org Subject: Re: shading/relocating 1.8.x? > On 25 Mar 2016, at 09:44, Tilman Hausherr wrote: > > Am 25.03.2016 um 17:39 schrieb John Hewson: >> Do we have some JIRA issues which identify some of these cases? > > https://issues.apache.org/jira/browse/PDFBOX-3265 > Great! Does anyone else have some others? — John > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For > additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: shading/relocating 1.8.x?
> On 25 Mar 2016, at 09:44, Tilman Hausherr wrote: > > Am 25.03.2016 um 17:39 schrieb John Hewson: >> Do we have some JIRA issues which identify some of these cases? > > https://issues.apache.org/jira/browse/PDFBOX-3265 > Great! Does anyone else have some others? — John > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: shading/relocating 1.8.x?
Am 25.03.2016 um 17:39 schrieb John Hewson: Do we have some JIRA issues which identify some of these cases? https://issues.apache.org/jira/browse/PDFBOX-3265 - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: shading/relocating 1.8.x?
> On 23 Mar 2016, at 06:20, Allison, Timothy B. wrote: > > All, > We've upgraded to 2.0.0 on Tika. Many thanks again! > One of our users is interested in continuing to use the > classic/SequentialParser, or at least having it available as a back-off > parser for corrupt pdfs [0]. Using the old parser really isn’t a good idea, it’s known to be pretty broken. I think that we would be much better off making sure the new parser can handle truncated files. We already do a lot of repair in the new parser, so this doesn’t seem like to much work? Maybe Andreas can comment further? Do we have some JIRA issues which identify some of these cases? — John > Would you be willing to distribute a shaded/relocated 1.8.x app so that we > could load both 1.8.x and 2.0.0 in the same jvm without collisions? Or, is > there a better solution? I wouldn’t recommend doing that, because you’re going to be stuck with using 1.8 for everything, not just parsing, at least as far as corrupt/truncated files are concerned. — John > Thank you! > > Cheers, > > Tim > > [0] > https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360 > > - > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org