Re: shading/relocating 1.8.x?

2016-04-07 Thread John Hewson

> On 29 Mar 2016, at 04:11, Andreas Lehmkühler  wrote:
> 
>> "Allison, Timothy B." mailto:talli...@mitre.org>> hat 
>> am 28. März 2016 um 21:02
>> geschrieben:
>> 
>> 
>> Oh, wow, so it really might be possible without too much work?  I'm more than
>> happy to supply examples. :) 
> Ups, it isn't as simply as it sounds. If we simply swallow the exception 
> pdfbox
> most likel runs into a NPE. IMHO we have to implement some sort of an on 
> demand
> parser which is able to handle null-values for specific parts of a pdf without
> throwing any exception.

One thought: instead of null it might be possible to return an empty string, 
empty
dictionary, empty array, empty stream, etc. That way we don’t have to look for 
null
everywhere.

— John

> 
>> Should I open an issue?
> Thanks, but I'm going to do that soon, as some other things should be done as
> well.
> 
> BR
> Andreas
>> 
>> 
>> -Original Message-
>> From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
>> Sent: Monday, March 28, 2016 10:58 AM
>> To: dev@pdfbox.apache.org
>> Subject: Re: shading/relocating 1.8.x?
>> 
>> Am 25.03.2016 um 17:39 schrieb John Hewson:
>>> 
>>>> On 23 Mar 2016, at 06:20, Allison, Timothy B.  wrote:
>>>> 
>>>> All,
>>>>  We've upgraded to 2.0.0 on Tika.  Many thanks again!
>>>>  One of our users is interested in continuing to use the
>>>> classic/SequentialParser, or at least having it available as a back-off
>>>> parser for corrupt pdfs [0].
>>> 
>>> Using the old parser really isn’t a good idea, it’s known to be pretty
>>> broken. I think that we would be much better off making sure the new parser
>>> can handle truncated files. We already do a lot of repair in the new parser,
>>> so this doesn’t seem like to much work? Maybe Andreas can comment further?
>> The biggest issue here is the truncated stream or dictionary. The current
>> version simply throws an exception when running into such constellations. We
>> have to implement some algorithm to ignore such incomplete parts of a pdf if
>> possible.
>> 
>> BR
>> Andreas
>> 
>>> 
>>> Do we have some JIRA issues which identify some of these cases?
>>> 
>>> — John
>>> 
>>>>  Would you be willing to distribute a shaded/relocated 1.8.x app so that
>>>> we could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or,
>>>> is there a better solution?
>>> 
>>> I wouldn’t recommend doing that, because you’re going to be stuck with using
>>> 1.8 for everything, not just parsing, at least as far as corrupt/truncated
>>> files are concerned.
>>> 
>>> — John
>>> 
>>>>  Thank you!
>>>> 
>>>>  Cheers,
>>>> 
>>>> Tim
>>>> 
>>>> [0]
>>>> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org 
> <mailto:dev-unsubscr...@pdfbox.apache.org>
> For additional commands, e-mail: dev-h...@pdfbox.apache.org 
> <mailto:dev-h...@pdfbox.apache.org>


RE: shading/relocating 1.8.x?

2016-03-29 Thread Allison, Timothy B.
Got it.  That's what I had assumed.

I'll hold off on opening truncated file issue(s) on PDFBox's JIRA...  I opened 
TIKA-1912 to track this on our side.

Thank you, again!

Best,

  Tim

-Original Message-
From: Andreas Lehmkühler [mailto:andr...@lehmi.de] 
Sent: Tuesday, March 29, 2016 7:12 AM
To: dev@pdfbox.apache.org
Subject: RE: shading/relocating 1.8.x?

> "Allison, Timothy B."  hat am 28. März 2016 um 
> 21:02
> geschrieben:
> 
> 
> Oh, wow, so it really might be possible without too much work?  I'm 
> more than happy to supply examples. :)
Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox 
most likel runs into a NPE. IMHO we have to implement some sort of an on demand 
parser which is able to handle null-values for specific parts of a pdf without 
throwing any exception.

> Should I open an issue?
Thanks, but I'm going to do that soon, as some other things should be done as 
well.

BR
Andreas



RE: shading/relocating 1.8.x?

2016-03-29 Thread Andreas Lehmkühler
> "Allison, Timothy B."  hat am 28. März 2016 um 21:02
> geschrieben:
> 
> 
> Oh, wow, so it really might be possible without too much work?  I'm more than
> happy to supply examples. :) 
Ups, it isn't as simply as it sounds. If we simply swallow the exception pdfbox
most likel runs into a NPE. IMHO we have to implement some sort of an on demand
parser which is able to handle null-values for specific parts of a pdf without
throwing any exception.

> Should I open an issue?
Thanks, but I'm going to do that soon, as some other things should be done as
well.

BR
Andreas
> 
> 
> -Original Message-
> From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
> Sent: Monday, March 28, 2016 10:58 AM
> To: dev@pdfbox.apache.org
> Subject: Re: shading/relocating 1.8.x?
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
> >
> >> On 23 Mar 2016, at 06:20, Allison, Timothy B.  wrote:
> >>
> >> All,
> >>   We've upgraded to 2.0.0 on Tika.  Many thanks again!
> >>   One of our users is interested in continuing to use the
> >> classic/SequentialParser, or at least having it available as a back-off
> >> parser for corrupt pdfs [0].
> >
> > Using the old parser really isn’t a good idea, it’s known to be pretty
> > broken. I think that we would be much better off making sure the new parser
> > can handle truncated files. We already do a lot of repair in the new parser,
> > so this doesn’t seem like to much work? Maybe Andreas can comment further?
> The biggest issue here is the truncated stream or dictionary. The current
> version simply throws an exception when running into such constellations. We
> have to implement some algorithm to ignore such incomplete parts of a pdf if
> possible.
> 
> BR
> Andreas
> 
> >
> > Do we have some JIRA issues which identify some of these cases?
> >
> > — John
> >
> >>   Would you be willing to distribute a shaded/relocated 1.8.x app so that
> >> we could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or,
> >> is there a better solution?
> >
> > I wouldn’t recommend doing that, because you’re going to be stuck with using
> > 1.8 for everything, not just parsing, at least as far as corrupt/truncated
> > files are concerned.
> >
> > — John
> >
> >>   Thank you!
> >>
> >>   Cheers,
> >>
> >>  Tim
> >>
> >> [0]
> >> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: shading/relocating 1.8.x?

2016-03-28 Thread Allison, Timothy B.
Oh, wow, so it really might be possible without too much work?  I'm more than 
happy to supply examples. :) 

Should I open an issue?


-Original Message-
From: Andreas Lehmkuehler [mailto:andr...@lehmi.de] 
Sent: Monday, March 28, 2016 10:58 AM
To: dev@pdfbox.apache.org
Subject: Re: shading/relocating 1.8.x?

Am 25.03.2016 um 17:39 schrieb John Hewson:
>
>> On 23 Mar 2016, at 06:20, Allison, Timothy B.  wrote:
>>
>> All,
>>   We've upgraded to 2.0.0 on Tika.  Many thanks again!
>>   One of our users is interested in continuing to use the 
>> classic/SequentialParser, or at least having it available as a back-off 
>> parser for corrupt pdfs [0].
>
> Using the old parser really isn’t a good idea, it’s known to be pretty 
> broken. I think that we would be much better off making sure the new parser 
> can handle truncated files. We already do a lot of repair in the new parser, 
> so this doesn’t seem like to much work? Maybe Andreas can comment further?
The biggest issue here is the truncated stream or dictionary. The current 
version simply throws an exception when running into such constellations. We 
have to implement some algorithm to ignore such incomplete parts of a pdf if 
possible.

BR
Andreas

>
> Do we have some JIRA issues which identify some of these cases?
>
> — John
>
>>   Would you be willing to distribute a shaded/relocated 1.8.x app so that we 
>> could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or, is 
>> there a better solution?
>
> I wouldn’t recommend doing that, because you’re going to be stuck with using 
> 1.8 for everything, not just parsing, at least as far as corrupt/truncated 
> files are concerned.
>
> — John
>
>>   Thank you!
>>
>>   Cheers,
>>
>>  Tim
>>
>> [0] 
>> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: shading/relocating 1.8.x?

2016-03-28 Thread Andreas Lehmkuehler

Am 25.03.2016 um 17:39 schrieb John Hewson:



On 23 Mar 2016, at 06:20, Allison, Timothy B.  wrote:

All,
  We've upgraded to 2.0.0 on Tika.  Many thanks again!
  One of our users is interested in continuing to use the 
classic/SequentialParser, or at least having it available as a back-off parser 
for corrupt pdfs [0].


Using the old parser really isn’t a good idea, it’s known to be pretty broken. 
I think that we would be much better off making sure the new parser can handle 
truncated files. We already do a lot of repair in the new parser, so this 
doesn’t seem like to much work? Maybe Andreas can comment further?
The biggest issue here is the truncated stream or dictionary. The current 
version simply throws an exception when running into such constellations. We 
have to implement some algorithm to ignore such incomplete parts of a pdf if 
possible.


BR
Andreas



Do we have some JIRA issues which identify some of these cases?

— John


  Would you be willing to distribute a shaded/relocated 1.8.x app so that we 
could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or, is 
there a better solution?


I wouldn’t recommend doing that, because you’re going to be stuck with using 
1.8 for everything, not just parsing, at least as far as corrupt/truncated 
files are concerned.

— John


  Thank you!

  Cheers,

 Tim

[0] 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



RE: shading/relocating 1.8.x?

2016-03-28 Thread Allison, Timothy B.
See:

https://issues.apache.org/jira/browse/TIKA-1285?focusedCommentId=15214111&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15214111
 

-Original Message-
From: John Hewson [mailto:j...@jahewson.com] 
Sent: Friday, March 25, 2016 1:03 PM
To: dev@pdfbox.apache.org
Subject: Re: shading/relocating 1.8.x?


> On 25 Mar 2016, at 09:44, Tilman Hausherr  wrote:
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
> 
> https://issues.apache.org/jira/browse/PDFBOX-3265
> 

Great! Does anyone else have some others?

— John

> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



RE: shading/relocating 1.8.x?

2016-03-25 Thread Allison, Timothy B.
Hi John,

  Normally, I'd agree.  And, y, I've been extremely grateful for the effort put 
into dealing with noisy PDFs in 2.0.0.  However, I think that the Tika user 
requesting this is interested in getting what he can from truncated and truly 
broken files -- e.g. Common Crawl data which (I think) truncates files at 1MB 
or may have had an interrupt during download.  My basic rule for opening an 
issue is if AR or another pdf parser can't parse it, I'm not going to ask for 
help.
 
   I wouldn't want to direct your all's efforts to dealing with the edge cases 
of truncated files.  If the old PDFParser is able to get something out because 
it parsed sequentially, then it would be neat to be able to have that available 
with very little effort.  In Tika, we envision allowing users to configure 
combinations of parsers for a given file, this would be the perfect case for 
the back-off-on-exception strategy -- if there's an exception with 2.0.0, try 
again with 1.8.x.

  I'll try shading/relocating next week, and see whether that works as expected.

  Thank you, all, again!

  Cheers,

Tim


-Original Message-
From: John Hewson [mailto:j...@jahewson.com] 
Sent: Friday, March 25, 2016 1:03 PM
To: dev@pdfbox.apache.org
Subject: Re: shading/relocating 1.8.x?


> On 25 Mar 2016, at 09:44, Tilman Hausherr  wrote:
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
> 
> https://issues.apache.org/jira/browse/PDFBOX-3265
> 

Great! Does anyone else have some others?

— John

> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For 
> additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional 
commands, e-mail: dev-h...@pdfbox.apache.org



Re: shading/relocating 1.8.x?

2016-03-25 Thread John Hewson

> On 25 Mar 2016, at 09:44, Tilman Hausherr  wrote:
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
> 
> https://issues.apache.org/jira/browse/PDFBOX-3265
> 

Great! Does anyone else have some others?

— John

> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: shading/relocating 1.8.x?

2016-03-25 Thread Tilman Hausherr

Am 25.03.2016 um 17:39 schrieb John Hewson:

Do we have some JIRA issues which identify some of these cases?


https://issues.apache.org/jira/browse/PDFBOX-3265

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: shading/relocating 1.8.x?

2016-03-25 Thread John Hewson

> On 23 Mar 2016, at 06:20, Allison, Timothy B.  wrote:
> 
> All,
>  We've upgraded to 2.0.0 on Tika.  Many thanks again!
>  One of our users is interested in continuing to use the 
> classic/SequentialParser, or at least having it available as a back-off 
> parser for corrupt pdfs [0].

Using the old parser really isn’t a good idea, it’s known to be pretty broken. 
I think that we would be much better off making sure the new parser can handle 
truncated files. We already do a lot of repair in the new parser, so this 
doesn’t seem like to much work? Maybe Andreas can comment further?

Do we have some JIRA issues which identify some of these cases?

— John

>  Would you be willing to distribute a shaded/relocated 1.8.x app so that we 
> could load both 1.8.x and 2.0.0 in the same jvm without collisions?  Or, is 
> there a better solution?

I wouldn’t recommend doing that, because you’re going to be stuck with using 
1.8 for everything, not just parsing, at least as far as corrupt/truncated 
files are concerned.

— John

>  Thank you!
> 
>  Cheers,
> 
> Tim
> 
> [0] 
> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org