[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2015-03-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14367950#comment-14367950
 ] 

Andreas Lehmkühler commented on TIKA-1098:
--

The parser stumbles upon a malformed annotation
{code}
1386 0 obj <<
/Type /Annot
/Border[0 0 0]/H/N/C[.5 .5 .5]
/Rect [307.1979 10.1075 314.1718 19.5der[0 0 0]/H/N/C[.5 .5 .5]
/Rect [276.3138 10.1075 283.2876 19.572]
/Subtype /Link
/A << /S /GoTo /D (Navigation36) >>
>> endobj
{code}
The first rectangle is the problem.

> not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
> 
>
> Key: TIKA-1098
> URL: https://issues.apache.org/jira/browse/TIKA-1098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: linux redhat
>Reporter: Qian Diao
> Attachments: url_1763_approx-alg-notes.pdf
>
>
> Hi,
> I got some parsing problems when using Tika 1.1 for the attached pdf file.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
> private static final String validBoilerpipeFilenameRegEx = 
> ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
> public String parseFile(File inFile) {
> if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
> return null;
>
> InputStream is = null;
> String outputText = "";
> try {
> // Open input stream
> is = new FileInputStream(inFile);
> // Prepare parser
> BodyContentHandler contenthandler = new 
> BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
> ParseContext pc = new ParseContext();
> // Call parse with boilerpipe if valid boilerpipe extension; 
> otherwise, call regular parse.
> if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
> Parser parser = new AutoDetectParser();
> parser.parse(is, contenthandler, metadata, pc);
> }
> else {
> Parser parser = new HtmlParser();
> BoilerpipeContentHandler bh = new 
> BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
> parser.parse(is, bh, metadata, pc);
> }
> // Prepare text for write
> outputText = contenthandler.toString();
> } catch (Exception e) {
> System.out.println(e);
> return null;
> } finally {
> try { 
> if (is != null) 
> is.close(); 
> } catch (Exception e) {}
> }
>
> return outputText;
> }
> =output
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2015-03-13 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361461#comment-14361461
 ] 

Tyler Palsulich commented on TIKA-1098:
---

Tika still can't parse this file. I tried with PDFBox 1.8.9 SNAPSHOT, but hit 
the following exception:
{code}
➜  trunk  java -jar ~/Downloads/pdfbox.jar ExtractText ~/Downloads/test.pdf
Mar 13, 2015 9:14:33 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 2390 is wrong. Fall back to reading stream 
until 'endstream'.
Mar 13, 2015 9:14:33 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
WARNING: Corrupt object reference
ExtractText failed with the following exception:
java.io.IOException: Unknown dir object c='>' cInt=62 peek='>' peekInt=62 364863
at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1362)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:249)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:356)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1264)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:641)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1239)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1129)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:212)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
{code}

Does anyone recognize this error? Or, should I open a new issue with PDFBox?

> not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
> 
>
> Key: TIKA-1098
> URL: https://issues.apache.org/jira/browse/TIKA-1098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: linux redhat
>Reporter: Qian Diao
> Attachments: url_1763_approx-alg-notes.pdf
>
>
> Hi,
> I got some parsing problems when using Tika 1.1 for the attached pdf file.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
> private static final String validBoilerpipeFilenameRegEx = 
> ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
> public String parseFile(File inFile) {
> if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
> return null;
>
> InputStream is = null;
> String outputText = "";
> try {
> // Open input stream
> is = new FileInputStream(inFile);
> // Prepare parser
> BodyContentHandler contenthandler = new 
> BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
> ParseContext pc = new ParseContext();
> // Call parse with boilerpipe if valid boilerpipe extension; 
> otherwise, call regular parse.
> if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
> Parser parser = new AutoDetectParser();
> parser.parse(is, contenthandler, metadata, pc);
> }
> else {
> Parser parser = new HtmlParser();
> BoilerpipeContentHandler bh = new 
> BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
> parser.parse(is, bh, metadata, pc);
> }
> // Prepare text for write
> outputText = contenthandler.toString();
> } catch (Exception e) {
> System.out.println(e);
> return null;
> } finally {
> try { 
> if (is != null) 
> is.close(); 
> } catch (Exception e) {}
> }
>
> return o

[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2014-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181630#comment-14181630
 ] 

Andreas Lehmkühler commented on TIKA-1098:
--

I've finally solved PDFBOX-1273. The fix will be part of the upcoming version 
1.8.8 and 2.0.0.

Thanks for your patience :-)

> not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
> 
>
> Key: TIKA-1098
> URL: https://issues.apache.org/jira/browse/TIKA-1098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: linux redhat
>Reporter: Qian Diao
> Attachments: url_1763_approx-alg-notes.pdf
>
>
> Hi,
> I got some parsing problems when using Tika 1.1 for the attached pdf file.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
> private static final String validBoilerpipeFilenameRegEx = 
> ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
> public String parseFile(File inFile) {
> if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
> return null;
>
> InputStream is = null;
> String outputText = "";
> try {
> // Open input stream
> is = new FileInputStream(inFile);
> // Prepare parser
> BodyContentHandler contenthandler = new 
> BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
> ParseContext pc = new ParseContext();
> // Call parse with boilerpipe if valid boilerpipe extension; 
> otherwise, call regular parse.
> if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
> Parser parser = new AutoDetectParser();
> parser.parse(is, contenthandler, metadata, pc);
> }
> else {
> Parser parser = new HtmlParser();
> BoilerpipeContentHandler bh = new 
> BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
> parser.parse(is, bh, metadata, pc);
> }
> // Prepare text for write
> outputText = contenthandler.toString();
> } catch (Exception e) {
> System.out.println(e);
> return null;
> } finally {
> try { 
> if (is != null) 
> is.close(); 
> } catch (Exception e) {}
> }
>
> return outputText;
> }
> =output
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2013-03-27 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615793#comment-13615793
 ] 

Michael McCandless commented on TIKA-1098:
--

Hmm PDFBox is hitting that exception when Tika calls .getAnnotations.

You might be able to workaround this if you call 
PDFParser.setExtractAnnotationText(false)?  Then Tika shouldn't call 
.getAnnotations...

It looks like PDFBOX-1273 is the same issue.

> not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
> 
>
> Key: TIKA-1098
> URL: https://issues.apache.org/jira/browse/TIKA-1098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: linux redhat
>Reporter: Qian Diao
> Attachments: url_1763_approx-alg-notes.pdf
>
>
> Hi,
> I got some parsing problems when using Tika 1.1 for the attached pdf file.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
> private static final String validBoilerpipeFilenameRegEx = 
> ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
> public String parseFile(File inFile) {
> if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
> return null;
>
> InputStream is = null;
> String outputText = "";
> try {
> // Open input stream
> is = new FileInputStream(inFile);
> // Prepare parser
> BodyContentHandler contenthandler = new 
> BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
> ParseContext pc = new ParseContext();
> // Call parse with boilerpipe if valid boilerpipe extension; 
> otherwise, call regular parse.
> if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
> Parser parser = new AutoDetectParser();
> parser.parse(is, contenthandler, metadata, pc);
> }
> else {
> Parser parser = new HtmlParser();
> BoilerpipeContentHandler bh = new 
> BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
> parser.parse(is, bh, metadata, pc);
> }
> // Prepare text for write
> outputText = contenthandler.toString();
> } catch (Exception e) {
> System.out.println(e);
> return null;
> } finally {
> try { 
> if (is != null) 
> is.close(); 
> } catch (Exception e) {}
> }
>
> return outputText;
> }
> =output
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2013-03-27 Thread Qian Diao (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615781#comment-13615781
 ] 

Qian Diao commented on TIKA-1098:
-

Here is the stachtrace:

org.apache.tika.exception.TikaException: Unable to extract PDF content
url_1763_approx-alg-notes.pdfat 
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:80)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:140)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.cisco.nc.autovocab.Test.parseFile(Test.java:36)
at com.cisco.nc.autovocab.Test.main(Test.java:70)
Caused by: java.io.IOException: Error: Unknown annotation type null
at 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation.createAnnotation(PDAnnotation.java:165)
at org.apache.pdfbox.pdmodel.PDPage.getAnnotations(PDPage.java:797)
at org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:142)
at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:444)
at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:63)
... 6 more

> not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
> 
>
> Key: TIKA-1098
> URL: https://issues.apache.org/jira/browse/TIKA-1098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: linux redhat
>Reporter: Qian Diao
> Attachments: url_1763_approx-alg-notes.pdf
>
>
> Hi,
> I got some parsing problems when using Tika 1.1 for the attached pdf file.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
> private static final String validBoilerpipeFilenameRegEx = 
> ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
> public String parseFile(File inFile) {
> if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
> return null;
>
> InputStream is = null;
> String outputText = "";
> try {
> // Open input stream
> is = new FileInputStream(inFile);
> // Prepare parser
> BodyContentHandler contenthandler = new 
> BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
> ParseContext pc = new ParseContext();
> // Call parse with boilerpipe if valid boilerpipe extension; 
> otherwise, call regular parse.
> if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
> Parser parser = new AutoDetectParser();
> parser.parse(is, contenthandler, metadata, pc);
> }
> else {
> Parser parser = new HtmlParser();
> BoilerpipeContentHandler bh = new 
> BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
> parser.parse(is, bh, metadata, pc);
> }
> // Prepare text for write
> outputText = contenthandler.toString();
> } catch (Exception e) {
> System.out.println(e);
> return null;
> } finally {
> try { 
> if (is != null) 
> is.close(); 
> } catch (Exception e) {}
> }
>
> return outputText;
> }
> =output
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏

2013-03-26 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13614932#comment-13614932
 ] 

Nick Burch commented on TIKA-1098:
--

Could you please post the full stacktrace, so we can work out where the problem 
is coming from?

> not able to parse pdfs/docs/ppts using 1.1 tika parser‏‏
> 
>
> Key: TIKA-1098
> URL: https://issues.apache.org/jira/browse/TIKA-1098
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1
> Environment: linux redhat
>Reporter: Qian Diao
> Attachments: url_1763_approx-alg-notes.pdf
>
>
> Hi,
> I got some parsing problems when using Tika 1.1 for the attached pdf file.
> my code (Test.java):
> import java.io.File;
> import java.io.InputStream;
> import java.io.FileInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.parser.AutoDetectParser;
> import org.apache.tika.parser.ParseContext;
> import org.apache.tika.parser.Parser;
> import org.apache.tika.parser.html.BoilerpipeContentHandler;
> import org.apache.tika.sax.BodyContentHandler;
> import org.apache.tika.parser.html.HtmlParser;
> import de.l3s.boilerpipe.extractors.ArticleExtractor;
> public class Test {
> private static final String validBoilerpipeFilenameRegEx = 
> ".*(\\.)(htm|html|shtml|php|asp|aspx)$";
> public String parseFile(File inFile) {
> if (inFile == null || !inFile.isFile() || !inFile.canRead()) 
> return null;
>
> InputStream is = null;
> String outputText = "";
> try {
> // Open input stream
> is = new FileInputStream(inFile);
> // Prepare parser
> BodyContentHandler contenthandler = new 
> BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName());
> ParseContext pc = new ParseContext();
> // Call parse with boilerpipe if valid boilerpipe extension; 
> otherwise, call regular parse.
> if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) {
> Parser parser = new AutoDetectParser();
> parser.parse(is, contenthandler, metadata, pc);
> }
> else {
> Parser parser = new HtmlParser();
> BoilerpipeContentHandler bh = new 
> BoilerpipeContentHandler(contenthandler, new ArticleExtractor());
> parser.parse(is, bh, metadata, pc);
> }
> // Prepare text for write
> outputText = contenthandler.toString();
> } catch (Exception e) {
> System.out.println(e);
> return null;
> } finally {
> try { 
> if (is != null) 
> is.close(); 
> } catch (Exception e) {}
> }
>
> return outputText;
> }
> =output
> org.apache.tika.exception.TikaException: Unable to extract PDF content
> url_1763_approx-alg-notes.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira