Re: IOException should be TikaException?

2014-07-01 Thread Jukka Zitting
Hi, On Tue, Jul 1, 2014 at 2:25 PM, Daniel Gibby wrote: > I'm wondering if perhaps PDFParser doesn't use the TaggedInputStream, but > AutoDetectParser does? The TikaInputStream class used by AutoDetectParser extends TaggedInputStream, but currently AutoDetectParser does not leverage that functio

Re: IOException should be TikaException?

2014-07-01 Thread Daniel Gibby
I'm also wondering whether the AutoDetectParser would handle these IOExceptions differently than the PDFParser. Does AutoDetectParser just hand off to the appropriate class, such as PDFParser? In other words, would my problem be solved by using AutoDetectParser instead of going straight to PDFPa

Re: IOException should be TikaException?

2014-07-01 Thread Daniel Gibby
On 7/1/2014 12:04 PM, Jukka Zitting wrote: The TaggedInputStream class [1] was designed for such cases where we want to distinguish between IOExceptions thrown by the underlying InputStream and those thrown by the library processing the stream. It can be us

Re: IOException should be TikaException?

2014-07-01 Thread Jukka Zitting
Hi, On Tue, Jul 1, 2014 at 1:51 PM, Nick Burch wrote: > On Fri, 27 Jun 2014, Daniel Gibby wrote: >> Shouldn't this be a TikaException of some type, or at least something >> other than just an IOException? > > One option might be to catch the IOException in the Tika code, then re-throw > it as a T

Re: IOException should be TikaException?

2014-07-01 Thread Daniel Gibby
I'll send a note over the the PDFBox list and ask what they think. Thanks, Daniel On 7/1/2014 11:51 AM, Nick Burch wrote: On Fri, 27 Jun 2014, Daniel Gibby wrote: java.io.IOException: Error: Header doesn't contain versioninfo at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.

Re: IOException should be TikaException?

2014-07-01 Thread Nick Burch
On Fri, 27 Jun 2014, Daniel Gibby wrote: java.io.IOException: Error: Header doesn't contain versioninfo at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:335) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:177) at org.apache.pdfbox.pdmodel.PDDocument.load

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
Good to hear. Let us know if you have any other questions or when you run into surprises. From: yeshwanth kumar [mailto:yeshwant...@gmail.com] Sent: Tuesday, July 01, 2014 10:23 AM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, i forgot to change the BodyContentHandler to

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
Hmmm…. When I use the ToXMLHandler on the test doc submitted with TIKA-1329, I see this: embed4.zip embed4.txt embed_4 That’s a text file inside of a zip file that is itself embedded. I could see doing some parsing on the XML to scrape out contents and grab the file name from the ele

RE: Stack Overflow Question

2014-07-01 Thread Allison, Timothy B.
Did you try the ToXMLHandler? From: yeshwanth kumar [mailto:yeshwant...@gmail.com] Sent: Monday, June 30, 2014 4:50 PM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, i tried in all possible ways, instead of reading entire zip file i parsed individual zipentries, but even th