[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367950#comment-14367950 ] Andreas Lehmkühler commented on TIKA-1098: -- The parser stumbles upon a malformed annotation {code} 1386 0 obj << /Type /Annot /Border[0 0 0]/H/N/C[.5 .5 .5] /Rect [307.1979 10.1075 314.1718 19.5der[0 0 0]/H/N/C[.5 .5 .5] /Rect [276.3138 10.1075 283.2876 19.572] /Subtype /Link /A << /S /GoTo /D (Navigation36) >> >> endobj {code} The first rectangle is the problem. > not able to parse pdfs/docs/ppts using 1.1 tika parser > > > Key: TIKA-1098 > URL: https://issues.apache.org/jira/browse/TIKA-1098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: linux redhat >Reporter: Qian Diao > Attachments: url_1763_approx-alg-notes.pdf > > > Hi, > I got some parsing problems when using Tika 1.1 for the attached pdf file. > my code (Test.java): > import java.io.File; > import java.io.InputStream; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.html.BoilerpipeContentHandler; > import org.apache.tika.sax.BodyContentHandler; > import org.apache.tika.parser.html.HtmlParser; > import de.l3s.boilerpipe.extractors.ArticleExtractor; > public class Test { > private static final String validBoilerpipeFilenameRegEx = > ".*(\\.)(htm|html|shtml|php|asp|aspx)$"; > public String parseFile(File inFile) { > if (inFile == null || !inFile.isFile() || !inFile.canRead()) > return null; > > InputStream is = null; > String outputText = ""; > try { > // Open input stream > is = new FileInputStream(inFile); > // Prepare parser > BodyContentHandler contenthandler = new > BodyContentHandler(-1); > Metadata metadata = new Metadata(); > metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName()); > ParseContext pc = new ParseContext(); > // Call parse with boilerpipe if valid boilerpipe extension; > otherwise, call regular parse. > if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) { > Parser parser = new AutoDetectParser(); > parser.parse(is, contenthandler, metadata, pc); > } > else { > Parser parser = new HtmlParser(); > BoilerpipeContentHandler bh = new > BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); > parser.parse(is, bh, metadata, pc); > } > // Prepare text for write > outputText = contenthandler.toString(); > } catch (Exception e) { > System.out.println(e); > return null; > } finally { > try { > if (is != null) > is.close(); > } catch (Exception e) {} > } > > return outputText; > } > =output > org.apache.tika.exception.TikaException: Unable to extract PDF content > url_1763_approx-alg-notes.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361461#comment-14361461 ] Tyler Palsulich commented on TIKA-1098: --- Tika still can't parse this file. I tried with PDFBox 1.8.9 SNAPSHOT, but hit the following exception: {code} ➜ trunk java -jar ~/Downloads/pdfbox.jar ExtractText ~/Downloads/test.pdf Mar 13, 2015 9:14:33 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 2390 is wrong. Fall back to reading stream until 'endstream'. Mar 13, 2015 9:14:33 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray WARNING: Corrupt object reference ExtractText failed with the following exception: java.io.IOException: Unknown dir object c='>' cInt=62 peek='>' peekInt=62 364863 at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1362) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:249) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:356) at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1264) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:641) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1239) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1129) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:212) at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) {code} Does anyone recognize this error? Or, should I open a new issue with PDFBox? > not able to parse pdfs/docs/ppts using 1.1 tika parser > > > Key: TIKA-1098 > URL: https://issues.apache.org/jira/browse/TIKA-1098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: linux redhat >Reporter: Qian Diao > Attachments: url_1763_approx-alg-notes.pdf > > > Hi, > I got some parsing problems when using Tika 1.1 for the attached pdf file. > my code (Test.java): > import java.io.File; > import java.io.InputStream; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.html.BoilerpipeContentHandler; > import org.apache.tika.sax.BodyContentHandler; > import org.apache.tika.parser.html.HtmlParser; > import de.l3s.boilerpipe.extractors.ArticleExtractor; > public class Test { > private static final String validBoilerpipeFilenameRegEx = > ".*(\\.)(htm|html|shtml|php|asp|aspx)$"; > public String parseFile(File inFile) { > if (inFile == null || !inFile.isFile() || !inFile.canRead()) > return null; > > InputStream is = null; > String outputText = ""; > try { > // Open input stream > is = new FileInputStream(inFile); > // Prepare parser > BodyContentHandler contenthandler = new > BodyContentHandler(-1); > Metadata metadata = new Metadata(); > metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName()); > ParseContext pc = new ParseContext(); > // Call parse with boilerpipe if valid boilerpipe extension; > otherwise, call regular parse. > if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) { > Parser parser = new AutoDetectParser(); > parser.parse(is, contenthandler, metadata, pc); > } > else { > Parser parser = new HtmlParser(); > BoilerpipeContentHandler bh = new > BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); > parser.parse(is, bh, metadata, pc); > } > // Prepare text for write > outputText = contenthandler.toString(); > } catch (Exception e) { > System.out.println(e); > return null; > } finally { > try { > if (is != null) > is.close(); > } catch (Exception e) {} > } > > return o
[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181630#comment-14181630 ] Andreas Lehmkühler commented on TIKA-1098: -- I've finally solved PDFBOX-1273. The fix will be part of the upcoming version 1.8.8 and 2.0.0. Thanks for your patience :-) > not able to parse pdfs/docs/ppts using 1.1 tika parser > > > Key: TIKA-1098 > URL: https://issues.apache.org/jira/browse/TIKA-1098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: linux redhat >Reporter: Qian Diao > Attachments: url_1763_approx-alg-notes.pdf > > > Hi, > I got some parsing problems when using Tika 1.1 for the attached pdf file. > my code (Test.java): > import java.io.File; > import java.io.InputStream; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.html.BoilerpipeContentHandler; > import org.apache.tika.sax.BodyContentHandler; > import org.apache.tika.parser.html.HtmlParser; > import de.l3s.boilerpipe.extractors.ArticleExtractor; > public class Test { > private static final String validBoilerpipeFilenameRegEx = > ".*(\\.)(htm|html|shtml|php|asp|aspx)$"; > public String parseFile(File inFile) { > if (inFile == null || !inFile.isFile() || !inFile.canRead()) > return null; > > InputStream is = null; > String outputText = ""; > try { > // Open input stream > is = new FileInputStream(inFile); > // Prepare parser > BodyContentHandler contenthandler = new > BodyContentHandler(-1); > Metadata metadata = new Metadata(); > metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName()); > ParseContext pc = new ParseContext(); > // Call parse with boilerpipe if valid boilerpipe extension; > otherwise, call regular parse. > if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) { > Parser parser = new AutoDetectParser(); > parser.parse(is, contenthandler, metadata, pc); > } > else { > Parser parser = new HtmlParser(); > BoilerpipeContentHandler bh = new > BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); > parser.parse(is, bh, metadata, pc); > } > // Prepare text for write > outputText = contenthandler.toString(); > } catch (Exception e) { > System.out.println(e); > return null; > } finally { > try { > if (is != null) > is.close(); > } catch (Exception e) {} > } > > return outputText; > } > =output > org.apache.tika.exception.TikaException: Unable to extract PDF content > url_1763_approx-alg-notes.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615793#comment-13615793 ] Michael McCandless commented on TIKA-1098: -- Hmm PDFBox is hitting that exception when Tika calls .getAnnotations. You might be able to workaround this if you call PDFParser.setExtractAnnotationText(false)? Then Tika shouldn't call .getAnnotations... It looks like PDFBOX-1273 is the same issue. > not able to parse pdfs/docs/ppts using 1.1 tika parser > > > Key: TIKA-1098 > URL: https://issues.apache.org/jira/browse/TIKA-1098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: linux redhat >Reporter: Qian Diao > Attachments: url_1763_approx-alg-notes.pdf > > > Hi, > I got some parsing problems when using Tika 1.1 for the attached pdf file. > my code (Test.java): > import java.io.File; > import java.io.InputStream; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.html.BoilerpipeContentHandler; > import org.apache.tika.sax.BodyContentHandler; > import org.apache.tika.parser.html.HtmlParser; > import de.l3s.boilerpipe.extractors.ArticleExtractor; > public class Test { > private static final String validBoilerpipeFilenameRegEx = > ".*(\\.)(htm|html|shtml|php|asp|aspx)$"; > public String parseFile(File inFile) { > if (inFile == null || !inFile.isFile() || !inFile.canRead()) > return null; > > InputStream is = null; > String outputText = ""; > try { > // Open input stream > is = new FileInputStream(inFile); > // Prepare parser > BodyContentHandler contenthandler = new > BodyContentHandler(-1); > Metadata metadata = new Metadata(); > metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName()); > ParseContext pc = new ParseContext(); > // Call parse with boilerpipe if valid boilerpipe extension; > otherwise, call regular parse. > if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) { > Parser parser = new AutoDetectParser(); > parser.parse(is, contenthandler, metadata, pc); > } > else { > Parser parser = new HtmlParser(); > BoilerpipeContentHandler bh = new > BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); > parser.parse(is, bh, metadata, pc); > } > // Prepare text for write > outputText = contenthandler.toString(); > } catch (Exception e) { > System.out.println(e); > return null; > } finally { > try { > if (is != null) > is.close(); > } catch (Exception e) {} > } > > return outputText; > } > =output > org.apache.tika.exception.TikaException: Unable to extract PDF content > url_1763_approx-alg-notes.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615781#comment-13615781 ] Qian Diao commented on TIKA-1098: - Here is the stachtrace: org.apache.tika.exception.TikaException: Unable to extract PDF content url_1763_approx-alg-notes.pdfat org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:80) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:140) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.cisco.nc.autovocab.Test.parseFile(Test.java:36) at com.cisco.nc.autovocab.Test.main(Test.java:70) Caused by: java.io.IOException: Error: Unknown annotation type null at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:165) at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:797) at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:142) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63) ... 6 more > not able to parse pdfs/docs/ppts using 1.1 tika parser > > > Key: TIKA-1098 > URL: https://issues.apache.org/jira/browse/TIKA-1098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: linux redhat >Reporter: Qian Diao > Attachments: url_1763_approx-alg-notes.pdf > > > Hi, > I got some parsing problems when using Tika 1.1 for the attached pdf file. > my code (Test.java): > import java.io.File; > import java.io.InputStream; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.html.BoilerpipeContentHandler; > import org.apache.tika.sax.BodyContentHandler; > import org.apache.tika.parser.html.HtmlParser; > import de.l3s.boilerpipe.extractors.ArticleExtractor; > public class Test { > private static final String validBoilerpipeFilenameRegEx = > ".*(\\.)(htm|html|shtml|php|asp|aspx)$"; > public String parseFile(File inFile) { > if (inFile == null || !inFile.isFile() || !inFile.canRead()) > return null; > > InputStream is = null; > String outputText = ""; > try { > // Open input stream > is = new FileInputStream(inFile); > // Prepare parser > BodyContentHandler contenthandler = new > BodyContentHandler(-1); > Metadata metadata = new Metadata(); > metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName()); > ParseContext pc = new ParseContext(); > // Call parse with boilerpipe if valid boilerpipe extension; > otherwise, call regular parse. > if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) { > Parser parser = new AutoDetectParser(); > parser.parse(is, contenthandler, metadata, pc); > } > else { > Parser parser = new HtmlParser(); > BoilerpipeContentHandler bh = new > BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); > parser.parse(is, bh, metadata, pc); > } > // Prepare text for write > outputText = contenthandler.toString(); > } catch (Exception e) { > System.out.println(e); > return null; > } finally { > try { > if (is != null) > is.close(); > } catch (Exception e) {} > } > > return outputText; > } > =output > org.apache.tika.exception.TikaException: Unable to extract PDF content > url_1763_approx-alg-notes.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614932#comment-13614932 ] Nick Burch commented on TIKA-1098: -- Could you please post the full stacktrace, so we can work out where the problem is coming from? > not able to parse pdfs/docs/ppts using 1.1 tika parser > > > Key: TIKA-1098 > URL: https://issues.apache.org/jira/browse/TIKA-1098 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.1 > Environment: linux redhat >Reporter: Qian Diao > Attachments: url_1763_approx-alg-notes.pdf > > > Hi, > I got some parsing problems when using Tika 1.1 for the attached pdf file. > my code (Test.java): > import java.io.File; > import java.io.InputStream; > import java.io.FileInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.parser.AutoDetectParser; > import org.apache.tika.parser.ParseContext; > import org.apache.tika.parser.Parser; > import org.apache.tika.parser.html.BoilerpipeContentHandler; > import org.apache.tika.sax.BodyContentHandler; > import org.apache.tika.parser.html.HtmlParser; > import de.l3s.boilerpipe.extractors.ArticleExtractor; > public class Test { > private static final String validBoilerpipeFilenameRegEx = > ".*(\\.)(htm|html|shtml|php|asp|aspx)$"; > public String parseFile(File inFile) { > if (inFile == null || !inFile.isFile() || !inFile.canRead()) > return null; > > InputStream is = null; > String outputText = ""; > try { > // Open input stream > is = new FileInputStream(inFile); > // Prepare parser > BodyContentHandler contenthandler = new > BodyContentHandler(-1); > Metadata metadata = new Metadata(); > metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName()); > ParseContext pc = new ParseContext(); > // Call parse with boilerpipe if valid boilerpipe extension; > otherwise, call regular parse. > if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) { > Parser parser = new AutoDetectParser(); > parser.parse(is, contenthandler, metadata, pc); > } > else { > Parser parser = new HtmlParser(); > BoilerpipeContentHandler bh = new > BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); > parser.parse(is, bh, metadata, pc); > } > // Prepare text for write > outputText = contenthandler.toString(); > } catch (Exception e) { > System.out.println(e); > return null; > } finally { > try { > if (is != null) > is.close(); > } catch (Exception e) {} > } > > return outputText; > } > =output > org.apache.tika.exception.TikaException: Unable to extract PDF content > url_1763_approx-alg-notes.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira