[jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note
[ https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040519#comment-14040519 ] Hong-Thai Nguyen commented on TIKA-1350: Richard Johnson (author of java-pstlib) is trying deploy new version 0.8.1 to Maven Center (ref. https://issues.sonatype.org/browse/OSSRH-8965?focusedCommentId=260254page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-260254) When this work done, we can upgrade to 0.8.1 in Tika dependence to get fix. OutlookPSTParser: Unknown message type: IPM.Note Key: TIKA-1350 URL: https://issues.apache.org/jira/browse/TIKA-1350 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Jonathan Evans Labels: libpst, parser, pst Fix For: 1.7 Original Estimate: 0.2h Remaining Estimate: 0.2h When parsing some emails in a PST file I get the error Unknown message type: IPM.Note preventing them from being parsed. This is because of an extra null byte at the end of the message class string. This has been fixed in version 0.8.1 of java-libpst so a version bump is all that is required. https://github.com/rjohnsondev/java-libpst/issues/14 I would attempt to do this myself but I am unsure how to open a pull request with SVN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040623#comment-14040623 ] William Palmer commented on TIKA-1232: -- I am currently out of the office and will be back on Thursday 26th June 2014. Any FOI requests should be sent to foi-enquir...@bl.uk. ** Experience the British Library online at www.bl.ukhttp://www.bl.uk/ The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook The Library's St Pancras site is WiFi - enabled * The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. * Think before you print Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, testComment.pdf I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note
What tika version will have the pst support? On Mon, Jun 23, 2014 at 4:23 AM, Hong-Thai Nguyen (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040519#comment-14040519 ] Hong-Thai Nguyen commented on TIKA-1350: Richard Johnson (author of java-pstlib) is trying deploy new version 0.8.1 to Maven Center (ref. https://issues.sonatype.org/browse/OSSRH-8965?focusedCommentId=260254page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-260254 ) When this work done, we can upgrade to 0.8.1 in Tika dependence to get fix. OutlookPSTParser: Unknown message type: IPM.Note Key: TIKA-1350 URL: https://issues.apache.org/jira/browse/TIKA-1350 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Jonathan Evans Labels: libpst, parser, pst Fix For: 1.7 Original Estimate: 0.2h Remaining Estimate: 0.2h When parsing some emails in a PST file I get the error Unknown message type: IPM.Note preventing them from being parsed. This is because of an extra null byte at the end of the message class string. This has been fixed in version 0.8.1 of java-libpst so a version bump is all that is required. https://github.com/rjohnsondev/java-libpst/issues/14 I would attempt to do this myself but I am unsure how to open a pull request with SVN. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output
[ https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040816#comment-14040816 ] Tyler Palsulich commented on TIKA-1232: --- Hey [~talli...@mitre.org]. A couple -- TIKA-758 will need an update (I put up a patch a few days ago corresponding to an older version upgrade, too) and the workaround in TIKA-1325 can be removed (since PDFBOX-2122 is resolved). Add PDF version to PDFParser output --- Key: TIKA-1232 URL: https://issues.apache.org/jira/browse/TIKA-1232 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Environment: JDK6 Reporter: William Palmer Assignee: Tim Allison Priority: Minor Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, testComment.pdf I'd like to identify the PDF version of files, this is not currently reported by the PDFParser although the information is available via PDFBox. I have attached a patch that adds the format version to the Metadata object. However, I am not familiar enough with the Tika source to know if an alternative metadata key should be used, or this new one added. Comments welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox
[ https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040819#comment-14040819 ] Tyler Palsulich commented on TIKA-617: -- Hi, Are you still having this issue? Do you have the/a PDF which caused this exception? Thanks! Tyler Series of exceptions from PDFBox Key: TIKA-617 URL: https://issues.apache.org/jira/browse/TIKA-617 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Erik Hetzner Hi, I am getting the following exception from PDFBox. Thank you! (If I should file these upstream at PDFBox first, please let me know.) {noformat} $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf /dev/null ERROR - Stop reading corrupt stream INFO - unsupported/disabled operation: f24.481 INFO - unsupported/disabled operation: ree)n. WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: i- INFO - unsupported/disabled operation: R4% INFO - unsupported/disabled operation: ) INFO - unsupported/disabled operation: Re.8 INFO - unsupported/disabled operation: e. INFO - unsupported/disabled operation: FE)- WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: R3% INFO - unsupported/disabled operation: T Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at
[jira] [Commented] (TIKA-1187) java.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/TIKA-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040830#comment-14040830 ] Tyler Palsulich commented on TIKA-1187: --- Hi [~Guffi], Did you ever get this issue resolved? Do you still have the problematic file? Thanks, Tyler java.lang.OutOfMemoryError: Java heap space --- Key: TIKA-1187 URL: https://issues.apache.org/jira/browse/TIKA-1187 Project: Tika Issue Type: Bug Components: general Affects Versions: 1.3 Environment: Ubuntu Reporter: GURFAN Priority: Critical Original Estimate: 612h Remaining Estimate: 612h Hi, While parsing the content we are getting below exception in parse method. The file which we are parsing is 1 mb. TIKA JAR: tika-core-1.3.jar File size: 1 MB. Parser parser = new AutoDetectParser(); parser.parse(is, handler, metaData, new ParseContext()); java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2734) at java.util.ArrayList.ensureCapacity(ArrayList.java:167) at java.util.ArrayList.add(ArrayList.java:351) at org.apache.fontbox.ttf.GlyfCompositeDescript.(GlyfCompositeDescript.java:60) at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:63) at org.apache.fontbox.ttf.GlyphTable.initData(GlyphTable.java:71) at org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:163) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:61) at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:90) at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:26) at org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:66) at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:26) at org.apache.tika.parser.font.TrueTypeParser.parse(TrueTypeParser.java:65) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.impetus.vajra.parser.tika.TikaParser.processContent(TikaParser.java:96) at com.impetus.vajra.storm.helper.TextAnalyserBoltHelper.execute(TextAnalyserBoltHelper.java:283) at com.impetus.vajra.storm.TextAnalyserBolt.execute(TextAnalyserBolt.java:182) at backtype.storm.daemon.executor$fn__4050$tuple_action_fn__4052.invoke(executor.clj:566) at backtype.storm.daemon.executor$mk_task_receiver$fn__3976.invoke(executor.clj:345) at backtype.storm.disruptor$clojure_handler$reify__1606.onEvent(disruptor.clj:43) at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:84) at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:58) at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62) at backtype.storm.daemon.executor$fn__4050$fn__4059$fn__4106.invoke(executor.clj:658) at backtype.storm.util$async_loop$fn__465.invoke(util.clj:377) at clojure.lang.AFn.run(AFn.java:24) at java.lang.Thread.run(Thread.java:662) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1352) Upgrade to PDFBox 1.8.6
Tim Allison created TIKA-1352: - Summary: Upgrade to PDFBox 1.8.6 Key: TIKA-1352 URL: https://issues.apache.org/jira/browse/TIKA-1352 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1352) Upgrade to PDFBox 1.8.6
[ https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1352: -- Attachment: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 workarounds removed) on a random selection of 10k pdf files in govdocs1. Both runs used the older sequential parser. The table file is a tab-delimited UTF-16LE file. This is a first go at the initial/raw output of comparison code for TIKA-1302. Much more work remains. The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds. Other than that, we should probably look at the few hundred files that have token overlap of 98%. To view the original files from gov docs (e.g. 765470), navigate to: http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf Upgrade to PDFBox 1.8.6 --- Key: TIKA-1352 URL: https://issues.apache.org/jira/browse/TIKA-1352 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1352) Upgrade to PDFBox 1.8.6
[ https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1352: -- Description: This is to track moving to PDFBox 1.8.6. Upgrade to PDFBox 1.8.6 --- Key: TIKA-1352 URL: https://issues.apache.org/jira/browse/TIKA-1352 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip This is to track moving to PDFBox 1.8.6. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note
On Mon, 23 Jun 2014, kevin slote wrote: What tika version will have the pst support? See TIKA-623 - PST support is already in trunk, and will be included in Tika 1.6 when that gets released Nick
[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox
[ https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040968#comment-14040968 ] Erik Hetzner commented on TIKA-617: --- The URL containing the PDF is listed in the above comment. Trying it with 1.5 gives different errors and generates an incomplete XML file: {noformat} java -jar tika-app-1.5.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf /dev/null ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException Exception in thread main org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:122) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112) Caused by: java.io.IOException at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:336) at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:248) at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:183) at org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:107) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106) ... 7 more Caused by: java.util.zip.DataFormatException: invalid distance too far back at java.util.zip.Inflater.inflateBytes(Native Method) at java.util.zip.Inflater.inflate(Inflater.java:259) at java.util.zip.Inflater.inflate(Inflater.java:280) at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169) at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98) ... 18 more {noformat} Series of exceptions from PDFBox Key: TIKA-617 URL: https://issues.apache.org/jira/browse/TIKA-617 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Erik Hetzner Hi, I am getting the following exception from PDFBox. Thank you! (If I should file these upstream at PDFBox first, please let me know.) {noformat} $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf /dev/null ERROR - Stop reading corrupt stream INFO - unsupported/disabled operation: f24.481 INFO - unsupported/disabled operation: ree)n. WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at
Re: [jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note
Do you already have this issue fixed? I encountered something similar to this and already worked it out. On Mon, Jun 23, 2014 at 1:03 PM, Nick Burch apa...@gagravarr.org wrote: On Mon, 23 Jun 2014, kevin slote wrote: What tika version will have the pst support? See TIKA-623 - PST support is already in trunk, and will be included in Tika 1.6 when that gets released Nick
[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox
[ https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041025#comment-14041025 ] Tim Allison commented on TIKA-617: -- Confirmed still a problem with both classic (sequential) and newer NonSequentialParser in Tika trunk with PDFBox 1.8.6. Please open an issue in PDFBox if you haven't done so already. Thank you! Found same issue here (although Adobe couldn't read this one either without serious problems): http://digitalcorpora.org/corp/nps/files/govdocs1/898/898385.pdf Series of exceptions from PDFBox Key: TIKA-617 URL: https://issues.apache.org/jira/browse/TIKA-617 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Erik Hetzner Hi, I am getting the following exception from PDFBox. Thank you! (If I should file these upstream at PDFBox first, please let me know.) {noformat} $ java -jar tika-app-1.0-SNAPSHOT.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf /dev/null ERROR - Stop reading corrupt stream INFO - unsupported/disabled operation: f24.481 INFO - unsupported/disabled operation: ree)n. WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: i- INFO - unsupported/disabled operation: R4% INFO - unsupported/disabled operation: ) INFO - unsupported/disabled operation: Re.8 INFO - unsupported/disabled operation: e. INFO - unsupported/disabled operation: FE)- WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSArray at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225) at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442) at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91) INFO - unsupported/disabled operation: R3% INFO - unsupported/disabled operation: T Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee at
[jira] [Comment Edited] (TIKA-1352) Upgrade to PDFBox 1.8.6
[ https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040934#comment-14040934 ] Tim Allison edited comment on TIKA-1352 at 6/23/14 5:39 PM: Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 workarounds removed) on a random selection of 10k pdf files in govdocs1. Both runs used the older sequential parser. The table file is a tab-delimited UTF-16LE file. This is a first go at the initial/raw output of comparison code for TIKA-1302. Much more work remains. The ZipBomb exceptions are caused by my incorrect first attempt to remove PDFBox-1130 workarounds. These will go away. Other than that, we should probably look at the few hundred files that have token overlap of 98%. To view the original files from gov docs (e.g. 765470), navigate to: http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf was (Author: talli...@mitre.org): Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 workarounds removed) on a random selection of 10k pdf files in govdocs1. Both runs used the older sequential parser. The table file is a tab-delimited UTF-16LE file. This is a first go at the initial/raw output of comparison code for TIKA-1302. Much more work remains. The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds. Other than that, we should probably look at the few hundred files that have token overlap of 98%. To view the original files from gov docs (e.g. 765470), navigate to: http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf Upgrade to PDFBox 1.8.6 --- Key: TIKA-1352 URL: https://issues.apache.org/jira/browse/TIKA-1352 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip This is to track moving to PDFBox 1.8.6. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata
Steve R created TIKA-1353: - Summary: OpenDocumentParser doesn't correctly process metadata Key: TIKA-1353 URL: https://issues.apache.org/jira/browse/TIKA-1353 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.5 Reporter: Steve R When using OpenDocumentParser, the metadata isn't set correctly. When using it to write an html file, the only metadata that it knows about is content type because it is set ahead of time. The problem is that when iterating over the zip contents, meta.xml isn't processed before content.xml. The metadata set on the parse object is correct after parse() returns, however the contents of the resulting html file is missing all of the metadata. Changing the code to be boolean parsedMetaData = false; boolean delayLoadContent = false; while (entry != null) { ... } else if (entry.getName().equals(meta.xml)) { meta.parse(zip, new DefaultHandler(), metadata, context); parsedMetaData = true; if (delayLoadContent) { if (content instanceof OpenDocumentContentParser) { ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); } else { // Foreign content parser was set: content.parse(zip, handler, metadata, context); } } } else if (entry.getName().endsWith(content.xml)) { if (!parsedMetaData) { delayLoadContent = true; } else { if (content instanceof OpenDocumentContentParser) { ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); } else { // Foreign content parser was set: content.parse(zip, handler, metadata, context); } } } works as expected. -- This message was sent by Atlassian JIRA (v6.2#6252)
Review Request 22892: New parser for ENVI header files
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/22892/ --- Review request for tika. Bugs: TIKA-1274 https://issues.apache.org/jira/browse/TIKA-1274 Repository: tika Description --- New parser for ENVI header files. Note, this is a parser for header files that will have an associated, separate data file. This parser will not extract content from the data file. Diffs - trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr PRE-CREATION Diff: https://reviews.apache.org/r/22892/diff/ Testing --- Text parsing test completed with file envi_test_header.hdr. Thanks, Ann Burgess
Re: Review Request 22892: New parser for ENVI header files
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/22892/#review46459 --- Looks great Annie, with the package updates I think I can commit this. - Chris Mattmann On June 23, 2014, 9:43 p.m., Ann Burgess wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/22892/ --- (Updated June 23, 2014, 9:43 p.m.) Review request for tika. Bugs: TIKA-1274 https://issues.apache.org/jira/browse/TIKA-1274 Repository: tika Description --- New parser for ENVI header files. Note, this is a parser for header files that will have an associated, separate data file. This parser will not extract content from the data file. Diffs - trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr PRE-CREATION Diff: https://reviews.apache.org/r/22892/diff/ Testing --- Text parsing test completed with file envi_test_header.hdr. Thanks, Ann Burgess
Re: Review Request 22892: New parser for ENVI header files
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/22892/#review46457 --- trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java https://reviews.apache.org/r/22892/#comment81848 org.apache.tika.parser.envi trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java https://reviews.apache.org/r/22892/#comment81849 org.apache.tika.parser.envi - Chris Mattmann On June 23, 2014, 9:43 p.m., Ann Burgess wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/22892/ --- (Updated June 23, 2014, 9:43 p.m.) Review request for tika. Bugs: TIKA-1274 https://issues.apache.org/jira/browse/TIKA-1274 Repository: tika Description --- New parser for ENVI header files. Note, this is a parser for header files that will have an associated, separate data file. This parser will not extract content from the data file. Diffs - trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr PRE-CREATION Diff: https://reviews.apache.org/r/22892/diff/ Testing --- Text parsing test completed with file envi_test_header.hdr. Thanks, Ann Burgess
Re: Review Request 22892: New parser for ENVI header files
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/22892/ --- (Updated June 23, 2014, 10:01 p.m.) Review request for tika. Bugs: TIKA-1274 https://issues.apache.org/jira/browse/TIKA-1274 Repository: tika Description --- New parser for ENVI header files. Note, this is a parser for header files that will have an associated, separate data file. This parser will not extract content from the data file. Diffs (updated) - trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr PRE-CREATION Diff: https://reviews.apache.org/r/22892/diff/ Testing --- Text parsing test completed with file envi_test_header.hdr. Thanks, Ann Burgess
Re: Review Request 22892: New parser for ENVI header files
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/22892/ --- (Updated June 23, 2014, 11:14 p.m.) Review request for tika. Bugs: TIKA-1274 https://issues.apache.org/jira/browse/TIKA-1274 Repository: tika Description --- New parser for ENVI header files. Note, this is a parser for header files that will have an associated, separate data file. This parser will not extract content from the data file. Diffs (updated) - trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java PRE-CREATION trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java PRE-CREATION trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr PRE-CREATION Diff: https://reviews.apache.org/r/22892/diff/ Testing --- Text parsing test completed with file envi_test_header.hdr. Thanks, Ann Burgess
[jira] [Resolved] (TIKA-1352) Upgrade to PDFBox 1.8.6
[ https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-1352. --- Resolution: Fixed Fix Version/s: 1.6 It turns out that we still need the PDFBOX-1130 workarounds because of the newly discovered PDFBOX-2160. Other than that, the results are quite similar with 1.8.5 and 1.8.6. We can remove quite a bit of clutter in the test cases because of improvements in 1.8.6, and we can unignore another test. Many thanks to our colleagues at PDFBox for all of these mods, especially [~tilman], [~jahewson] and [~lehmi]! I included some of the code cleanups recommended by [~tpalsulich] over on TIKA-758. Thank you, Tyler! Finally, [~lfcnassif], if you have a chance to run trunk against your test set, please let us know if there are any surprises. I'd prefer not to have a rerun of TIKA-1233. Fixed r1604989. Upgrade to PDFBox 1.8.6 --- Key: TIKA-1352 URL: https://issues.apache.org/jira/browse/TIKA-1352 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.6 Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip This is to track moving to PDFBox 1.8.6. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-758) Address TODOs when we upgrade to next PDFBox release
[ https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041569#comment-14041569 ] Tim Allison commented on TIKA-758: -- [~tpalsulich], thank you for the patch. It turns out that we need to retain this work-around until PDFBOX-2160's fix makes it into the next release of PDFBox (1.8.7?). I included your code cleanups in the mods for TIKA-1352. Thank you! Address TODOs when we upgrade to next PDFBox release Key: TIKA-758 URL: https://issues.apache.org/jira/browse/TIKA-758 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Attachments: TIKA-758.Palsulich.061714.patch Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in the code when we next upgrade PDFBox. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1352) Upgrade to PDFBox 1.8.6
[ https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041615#comment-14041615 ] Hudson commented on TIKA-1352: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #62 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/62/]) TIKA-1352 update CHANGES.txt (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1604995) * /tika/trunk/CHANGES.txt TIKA-1352 upgrade to PDFBox 1.8.6 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1604989) * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java Upgrade to PDFBox 1.8.6 --- Key: TIKA-1352 URL: https://issues.apache.org/jira/browse/TIKA-1352 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.6 Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip This is to track moving to PDFBox 1.8.6. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1352) Upgrade to PDFBox 1.8.6
[ https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041632#comment-14041632 ] Hudson commented on TIKA-1352: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #62 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/62/]) TIKA-1352 update CHANGES.txt (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1604995) * /tika/trunk/CHANGES.txt TIKA-1352 upgrade to PDFBox 1.8.6 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1604989) * /tika/trunk/tika-parsers/pom.xml * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java Upgrade to PDFBox 1.8.6 --- Key: TIKA-1352 URL: https://issues.apache.org/jira/browse/TIKA-1352 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Fix For: 1.6 Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip This is to track moving to PDFBox 1.8.6. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata
[ https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041639#comment-14041639 ] Steve R commented on TIKA-1353: --- Ignore my suggested code example, it clearly doesn't work. My question is now this, why is the following code commented out? It seems to work. /* * ZipFile zipFile; if (stream instanceof TikaInputStream) { TikaInputStream tis = (TikaInputStream) stream; * Object container = ((TikaInputStream) stream).getOpenContainer(); if (container instanceof ZipFile) { zipFile * = (ZipFile) container; } else if (tis.hasFile()) { zipFile = new ZipFile(tis.getFile()); } } */ // TODO: if incoming IS is a TIS with a file // associated, we should open ZipFile so we can // visit metadata, mimetype first; today we lose // all the metadata if meta.xml is hit after // content.xml in the stream. Then we can still // read-once for the content.xml. OpenDocumentParser doesn't correctly process metadata - Key: TIKA-1353 URL: https://issues.apache.org/jira/browse/TIKA-1353 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.5 Reporter: Steve R Original Estimate: 24h Remaining Estimate: 24h When using OpenDocumentParser, the metadata isn't set correctly. When using it to write an html file, the only metadata that it knows about is content type because it is set ahead of time. The problem is that when iterating over the zip contents, meta.xml isn't processed before content.xml. The metadata set on the parse object is correct after parse() returns, however the contents of the resulting html file is missing all of the metadata. Changing the code to be boolean parsedMetaData = false; boolean delayLoadContent = false; while (entry != null) { ... } else if (entry.getName().equals(meta.xml)) { meta.parse(zip, new DefaultHandler(), metadata, context); parsedMetaData = true; if (delayLoadContent) { if (content instanceof OpenDocumentContentParser) { ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); } else { // Foreign content parser was set: content.parse(zip, handler, metadata, context); } } } else if (entry.getName().endsWith(content.xml)) { if (!parsedMetaData) { delayLoadContent = true; } else { if (content instanceof OpenDocumentContentParser) { ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); } else { // Foreign content parser was set: content.parse(zip, handler, metadata, context); } } } works as expected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata
[ https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041639#comment-14041639 ] Steve R edited comment on TIKA-1353 at 6/24/14 3:09 AM: Ignore my suggested code example, it clearly doesn't work. My question is now this, why is the following code commented out? It seems to work. /* ZipFile zipFile; if (stream instanceof TikaInputStream) { TikaInputStream tis = (TikaInputStream) stream; Object container = ((TikaInputStream) stream).getOpenContainer(); if (container instanceof ZipFile) { zipFile = (ZipFile) container; else if (tis.hasFile()) { zipFile = new ZipFile(tis.getFile()); } } */ // TODO: if incoming IS is a TIS with a file // associated, we should open ZipFile so we can // visit metadata, mimetype first; today we lose // all the metadata if meta.xml is hit after // content.xml in the stream. Then we can still // read-once for the content.xml. was (Author: svramusi): Ignore my suggested code example, it clearly doesn't work. My question is now this, why is the following code commented out? It seems to work. /* * ZipFile zipFile; if (stream instanceof TikaInputStream) { TikaInputStream tis = (TikaInputStream) stream; * Object container = ((TikaInputStream) stream).getOpenContainer(); if (container instanceof ZipFile) { zipFile * = (ZipFile) container; } else if (tis.hasFile()) { zipFile = new ZipFile(tis.getFile()); } } */ // TODO: if incoming IS is a TIS with a file // associated, we should open ZipFile so we can // visit metadata, mimetype first; today we lose // all the metadata if meta.xml is hit after // content.xml in the stream. Then we can still // read-once for the content.xml. OpenDocumentParser doesn't correctly process metadata - Key: TIKA-1353 URL: https://issues.apache.org/jira/browse/TIKA-1353 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.5 Reporter: Steve R Original Estimate: 24h Remaining Estimate: 24h When using OpenDocumentParser, the metadata isn't set correctly. When using it to write an html file, the only metadata that it knows about is content type because it is set ahead of time. The problem is that when iterating over the zip contents, meta.xml isn't processed before content.xml. The metadata set on the parse object is correct after parse() returns, however the contents of the resulting html file is missing all of the metadata. Changing the code to be boolean parsedMetaData = false; boolean delayLoadContent = false; while (entry != null) { ... } else if (entry.getName().equals(meta.xml)) { meta.parse(zip, new DefaultHandler(), metadata, context); parsedMetaData = true; if (delayLoadContent) { if (content instanceof OpenDocumentContentParser) { ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); } else { // Foreign content parser was set: content.parse(zip, handler, metadata, context); } } } else if (entry.getName().endsWith(content.xml)) { if (!parsedMetaData) { delayLoadContent = true; } else { if (content instanceof OpenDocumentContentParser) { ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); } else { // Foreign content parser was set: content.parse(zip, handler, metadata, context); } } } works as expected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata
[ https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041639#comment-14041639 ] Steve R edited comment on TIKA-1353 at 6/24/14 3:09 AM: Ignore my suggested code example, it clearly doesn't work. My question is now this, why is the following code commented out? It seems to work. /* ZipFile zipFile; if (stream instanceof TikaInputStream) { TikaInputStream tis = (TikaInputStream) stream; Object container = ((TikaInputStream) stream).getOpenContainer(); if (container instanceof ZipFile) { zipFile = (ZipFile) container; } else if (tis.hasFile()) { zipFile = new ZipFile(tis.getFile()); } } */ // TODO: if incoming IS is a TIS with a file // associated, we should open ZipFile so we can // visit metadata, mimetype first; today we lose // all the metadata if meta.xml is hit after // content.xml in the stream. Then we can still // read-once for the content.xml. was (Author: svramusi): Ignore my suggested code example, it clearly doesn't work. My question is now this, why is the following code commented out? It seems to work. /* ZipFile zipFile; if (stream instanceof TikaInputStream) { TikaInputStream tis = (TikaInputStream) stream; Object container = ((TikaInputStream) stream).getOpenContainer(); if (container instanceof ZipFile) { zipFile = (ZipFile) container; else if (tis.hasFile()) { zipFile = new ZipFile(tis.getFile()); } } */ // TODO: if incoming IS is a TIS with a file // associated, we should open ZipFile so we can // visit metadata, mimetype first; today we lose // all the metadata if meta.xml is hit after // content.xml in the stream. Then we can still // read-once for the content.xml. OpenDocumentParser doesn't correctly process metadata - Key: TIKA-1353 URL: https://issues.apache.org/jira/browse/TIKA-1353 Project: Tika Issue Type: Bug Components: metadata, parser Affects Versions: 1.5 Reporter: Steve R Original Estimate: 24h Remaining Estimate: 24h When using OpenDocumentParser, the metadata isn't set correctly. When using it to write an html file, the only metadata that it knows about is content type because it is set ahead of time. The problem is that when iterating over the zip contents, meta.xml isn't processed before content.xml. The metadata set on the parse object is correct after parse() returns, however the contents of the resulting html file is missing all of the metadata. Changing the code to be boolean parsedMetaData = false; boolean delayLoadContent = false; while (entry != null) { ... } else if (entry.getName().equals(meta.xml)) { meta.parse(zip, new DefaultHandler(), metadata, context); parsedMetaData = true; if (delayLoadContent) { if (content instanceof OpenDocumentContentParser) { ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); } else { // Foreign content parser was set: content.parse(zip, handler, metadata, context); } } } else if (entry.getName().endsWith(content.xml)) { if (!parsedMetaData) { delayLoadContent = true; } else { if (content instanceof OpenDocumentContentParser) { ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); } else { // Foreign content parser was set: content.parse(zip, handler, metadata, context); } } } works as expected. -- This message was sent by Atlassian JIRA (v6.2#6252)