Did you try the ToXMLHandler? From: yeshwanth kumar [mailto:yeshwant...@gmail.com] Sent: Monday, June 30, 2014 4:50 PM To: Allison, Timothy B. Subject: Re: Stack Overflow Question
hi tim, i tried in all possible ways, instead of reading entire zip file i parsed individual zipentries, but even then i faced exceptions such as org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@37ba3e33<mailto:org.apache.tika.parser.microsoft.OfficeParser@37ba3e33> Caused by: java.io.IOException: Invalid header signature; read 0x725020706968736E, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document org.apache.tika.exception.TikaException: Unable to unpack document stream org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a<mailto:org.apache.tika.parser.microsoft.OfficeParser@6f0ee75a> org.apache.tika.exception.TikaException: Error creating OOXML extractor any suggestions regarding these issues, thanks, yeshwanth On Tue, Jul 1, 2014 at 2:00 AM, yeshwanth kumar <yeshwant...@gmail.com<mailto:yeshwant...@gmail.com>> wrote: hi tim, thanks, for sharing the resources but i am unable to figure out how to implement it in my code, what i didn't understand is the flow and recursive steps, when i ran the RecursiveMetadataParser it still giving the same kind of output as filenames combined with content of the files, i am totally confused. On Tue, Jul 1, 2014 at 1:29 AM, Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: Or use the ToXMLHandler and parse the XML? From: Allison, Timothy B. [mailto:talli...@mitre.org<mailto:talli...@mitre.org>] Sent: Monday, June 30, 2014 3:55 PM To: yeshwanth kumar Cc: user@tika.apache.org<mailto:user@tika.apache.org> Subject: RE: Stack Overflow Question Might want to look into RecursiveMetadata Parser http://wiki.apache.org/tika/RecursiveMetadata Or https://issues.apache.org/jira/i#browse/TIKA-1329?issueKey=TIKA-1329&serverRenderedViewIssue=true From: yeshwanth kumar [mailto:yeshwant...@gmail.com] Sent: Monday, June 30, 2014 3:24 PM To: Allison, Timothy B. Subject: Re: Stack Overflow Question hi tim, thanks for quick reply, i changed the contenthandler to bodyContentHandler i got exception for maximum word limit, i used -1 in the bodycontenthandler constructor, now its another problem, filenames and content are present in string returned from handler.tostring() how can i map a fileName to its content. thanks, yeshwanth On Tue, Jul 1, 2014 at 12:35 AM, Allison, Timothy B. <talli...@mitre.org<mailto:talli...@mitre.org>> wrote: DefaultHandler is effectively a NullHandler; it doesn't store or do anything. Try BodyContentHandler or ToXMLHandler or maybe WriteoutHandler. If you want to write out each embedded file as a binary, try subclassing EmbeddedResourceHandler. QUOTE: 0down votefavorite<http://stackoverflow.com/questions/24495504/unable-tp-read-zipfile-using-apache-tika?sem=2> i am using Apache Tika 1.5 for parsing the contents present in a zip file, here's my sample code Parser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); context.set(Parser.class, parser); ContentHandler handler = new DefaultHandler(); Metadata metadata = new Metadata(); InputStream stream = null; try { stream = TikaInputStream.get(new File(zipFilePath)); } catch (FileNotFoundException e) { e.printStackTrace(); } try { parser.parse(stream, handler, metadata, context); logger.info<http://logger.info>("Content:\t" + handler.toString()); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } finally { try { stream.close(); } catch (IOException e) { e.printStackTrace(); } } in the logger statement all i see is org.xml.sax.helpers.DefaultHandler@5bd8e367<mailto:org.xml.sax.helpers.DefaultHandler@5bd8e367> i am missing something, unable to figure it out, looking for some help -----Original Message----- From: yeshwanth kumar [mailto:yeshwant...@gmail.com<mailto:yeshwant...@gmail.com>] Sent: Monday, June 30, 2014 1:28 PM To: d...@tika.apache.org<mailto:d...@tika.apache.org> Subject: Stack Overflow Question Unable tp read zipfile using Apache Tika http://stackoverflow.com/q/24495504/1899893?sem=2