[
https://issues.apache.org/jira/browse/TIKA-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613508#comment-13613508
]
Nick Burch commented on TIKA-1097:
----------------------------------
This might be best being split into multiple bug reports, one for each class of
problem. If you have several documents that all trigger the same stacktrace,
I'd suggest you open a new bug for each one, and upload a (ideally small)
sample file that shows the problem
As it stands, without stacktraces and without files, there's not a lot we can
do. Additionally, the problem with the PDFs is unlikely to be the same as the
one affecting your .docs, which is why they'd be best as independent bugs
> not able to parse pdfs/docs/ppts using 1.1 and 1.3 tika parser
> ----------------------------------------------------------------
>
> Key: TIKA-1097
> URL: https://issues.apache.org/jira/browse/TIKA-1097
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.1, 1.3
> Environment: linux redhat
> Reporter: Qian Diao
> Fix For: 1.1, 1.3
>
>
> Hi,
> I got some parsing problems when using Tika 1.1. Some pdfs, docs and ppts
> were not getting parsed.
> So, tried with 1.3. Still some pdfs/docs/ppts can not be parsed.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
> private static final String validBoilerpipeFilenameRegEx =
> ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
> public String parseFile(File inFile) {
> if (inFile == null || !inFile.isFile() || !inFile.canRead())
> return null;
>
> InputStream is = null;
> String outputText = "";
> try {
> // Open input stream
> is = new FileInputStream(inFile);
> // Prepare parser
> BodyContentHandler contenthandler = new
> BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
> ParseContext pc = new ParseContext();
> // Call parse with boilerpipe if valid boilerpipe extension;
> otherwise, call regular parse.
> if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
> Parser parser = new AutoDetectParser();
> parser.parse(is, contenthandler, metadata, pc);
> }
> else {
> Parser parser = new HtmlParser();
> BoilerpipeContentHandler bh = new
> BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
> parser.parse(is, bh, metadata, pc);
> }
> // Prepare text for write
> outputText = contenthandler.toString();
> } catch (Exception e) {
> System.out.println(e);
> return null;
> } finally {
> try {
> if (is != null)
> is.close();
> } catch (Exception e) {}
> }
>
> return outputText;
> }
> ======
> output:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@3a6ac461
> url_4080_ETS11_TAGMatrix_rev070111.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@2b03be0
> url_2275_Paper26Pages253-269.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@4f9a32e0
> url_5889_viz.96.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@4e513d61
> url_1556_sensys_awoo03.pdf
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@426295eb
> url_5300_sudoku2.pdf?referrer=webcluster&
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@7c2e1f1f
> url_1441_ChoosingYourFirstCSCourse2011-FINAL.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@7eda18ac
> url_4272_20080218121324_723.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@6f0ffb38
> url_2491_2106_crime_scene.doc
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@4cedf389
> url_5227_Romano-Library%20Research%20Series%20-%20March%2029%202007%20FINAL(small).ppt
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@6126f827
> url_5250_linked%20list.ppt
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@3749eb9f
> url_2011_undergrad-brochure.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@3a289d2e
> url_5709_final_presentation_bak.ppt
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@5ddc0e7a
> url_5319_2011_2012_advising_guidelines.pdf
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@7dc5ddc9
> url_3502_TheEvolvingRoleTech.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@4963f7a1
> url_2403_class_presentation_Btree.ppt
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@7ba85d38
> url_4040_fukunaga_jair07_bin.pdf
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@6a8046f4
> url_2472_COP3530OverheadsF99.doc
> Thanks,
> Qian
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira