[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata
[ https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169968#comment-15169968 ] Jeremy B. Merrill commented on TIKA-1865: - My heart wants to say yes, but my calendar says no. :) Or at least not with any time super soon. You're right that this is a ticket that's interesting to me, though. I did just get my own dump of real-life .msg files (not shareable, unfortunately) and I've noticed how senders' email addresses seem to get lost, which is a pain... Is this just a feature that is not yet implemented? Or is there an underlying reason why? (Funnily enough, it matches the behavior of Outlook printouts, which gives you only the sender's alias, not their address -- including, most annoyingly for me, in the dumps of Hillary Clinton's emails that the State Dept. has been releasing.) Do we know if all the various email formats include the sender's email address, so it'd be theoretically accessible to Tika somehow? What even are all the formats for emails that Tika handles? Outlook (PST/MSG), .eml/rfc822, mbox, anything else? > Save sender email address in Outlook MSG metadata > - > > Key: TIKA-1865 > URL: https://issues.apache.org/jira/browse/TIKA-1865 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.11 > Environment: Windows 7 x64, jre 1.8.0_60 x64 >Reporter: Luis Filipe Nassif > > Sender email address is lost when extracting metadata from Outlook msg files. > Currently only sender name is extracted. That is an important information to > be extracted for search engines. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1771) lower magic priority xhtml magic priority to ensure emails detected as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy B. Merrill updated TIKA-1771: Description: Emails I have (happy to share if you want) contain XHTML, as one part of a multipart email. Prior to this pull request, the priority on the application/xhtml+xml magic detector was 50, equal to the priority on the message/rfc822 detector. Because of the relative position of the two detectors in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents. With this PR, by downgrading the priority of application/xhtml+xml to 40, the more-sensitive email magic detectors take precedence, causing the emails to be properly detected as message/rfc822. I have not run this thru the govdocs tester or anything other than my own documents, so, full disclosure, this could cause false negative xhtml-detections elsewhere. I should note this occurs on trunk, from Github, up-to-date as of Tuesday-ish. was: Emails I have (happy to share if you want) contain XHTML, as one part of a multipart email. Prior to this pull request, the priority on the application/xhtml+xml magic detector was 50, equal to the priority on the message/rfc822 detector. Because of the relative position of the two detectors in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents. With this PR, by downgrading the priority of application/xhtml+xml to 40, the more-sensitive email magic detectors take precedence, causing the emails to be properly detected as message/rfc822. I have not run this thru the govdocs tester or anything other than my own documents, so, full disclosure, this could cause false negative xhtml-detections elsewhere. > lower magic priority xhtml magic priority to ensure emails detected as > message/rfc822 > - > > Key: TIKA-1771 > URL: https://issues.apache.org/jira/browse/TIKA-1771 > Project: Tika > Issue Type: Improvement > Components: detector >Reporter: Jeremy B. Merrill >Priority: Critical > > Emails I have (happy to share if you want) contain XHTML, as one part of a > multipart email. Prior to this pull request, the priority on the > application/xhtml+xml magic detector was 50, equal to the priority on the > message/rfc822 detector. Because of the relative position of the two > detectors in tika-mimetypes.xml, the emails were incorrectly detected as > XHTML documents. > With this PR, by downgrading the priority of application/xhtml+xml to 40, the > more-sensitive email magic detectors take precedence, causing the emails to > be properly detected as message/rfc822. > I have not run this thru the govdocs tester or anything other than my own > documents, so, full disclosure, this could cause false negative > xhtml-detections elsewhere. > I should note this occurs on trunk, from Github, up-to-date as of Tuesday-ish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1771) lower magic priority xhtml magic priority to ensure emails detected as message/rfc822
Jeremy B. Merrill created TIKA-1771: --- Summary: lower magic priority xhtml magic priority to ensure emails detected as message/rfc822 Key: TIKA-1771 URL: https://issues.apache.org/jira/browse/TIKA-1771 Project: Tika Issue Type: Improvement Components: detector Reporter: Jeremy B. Merrill Priority: Critical Emails I have (happy to share if you want) contain XHTML, as one part of a multipart email. Prior to this pull request, the priority on the application/xhtml+xml magic detector was 50, equal to the priority on the message/rfc822 detector. Because of the relative position of the two detectors in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents. With this PR, by downgrading the priority of application/xhtml+xml to 40, the more-sensitive email magic detectors take precedence, causing the emails to be properly detected as message/rfc822. I have not run this thru the govdocs tester or anything other than my own documents, so, full disclosure, this could cause false negative xhtml-detections elsewhere. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619117#comment-14619117 ] Jeremy B. Merrill commented on TIKA-1602: - Looks like the possible values are: ``` Status: O Status: Status: U Status: O Status: R Status: RO Status: U Status: U ``` Detecting standards-non-compliant emails as message/rfc822 -- Key: TIKA-1602 URL: https://issues.apache.org/jira/browse/TIKA-1602 Project: Tika Issue Type: New Feature Components: mime Reporter: Jeremy B. Merrill Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.10 Attachments: 036491.txt.zip Original Estimate: 1h Remaining Estimate: 1h Tika does not properly detect certain emails as `message/rfc822` if they're slightly standards-non-compliant and begin with `Status: ` as the first header. I've added `Status: ` as a magic detection line in tika-mimetypes.xml. This solves my problem and does not appear to cause unit test failures. I have not yet run the tika-batch tests. As further information, the emails that are processed incorrectly come from dumps directly from various US public officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant. It's important to note that Tika (and the underlying library, James Mime4J) do properly *parse* these emails, despite the non-compliant header. The problem is getting Tika to *detect* the file as an email so that Mime4J gets chosen to parse it. Pull request on Github at https://github.com/apache/tika/pull/40 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612321#comment-14612321 ] Jeremy B. Merrill commented on TIKA-1602: - Thank you, [~chrismattmann], [~talli...@mitre.org] et al.! [~talli...@mitre.org] -- got a bunch of normal headers, but also this `Status:` one. The only possible value in my dataset (a bunch of publicly-released emails from Jeb Bush's tenure as FL Gov) is `RO`, so the first lines of the emails who were treated improperly by Tika before this patch was uniformly `Status: RO`. I'm going to check the whole dataset once I manage to download it all back down again from storage to make sure there are no other values than `RO`. My understanding is that some mail servers use this header internally to keep track of read status. When emails are exported, they retain the header, and it sometimes appears first -- even though the server would never send this header over the wire. Detecting standards-non-compliant emails as message/rfc822 -- Key: TIKA-1602 URL: https://issues.apache.org/jira/browse/TIKA-1602 Project: Tika Issue Type: New Feature Components: mime Reporter: Jeremy B. Merrill Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.10 Attachments: 036491.txt.zip Original Estimate: 1h Remaining Estimate: 1h Tika does not properly detect certain emails as `message/rfc822` if they're slightly standards-non-compliant and begin with `Status: ` as the first header. I've added `Status: ` as a magic detection line in tika-mimetypes.xml. This solves my problem and does not appear to cause unit test failures. I have not yet run the tika-batch tests. As further information, the emails that are processed incorrectly come from dumps directly from various US public officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant. It's important to note that Tika (and the underlying library, James Mime4J) do properly *parse* these emails, despite the non-compliant header. The problem is getting Tika to *detect* the file as an email so that Mime4J gets chosen to parse it. Pull request on Github at https://github.com/apache/tika/pull/40 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy B. Merrill updated TIKA-1608: Attachment: 1534-attachment.doc document failing under this bug RuntimeException on extracting text from Word 97-2004 Document -- Key: TIKA-1608 URL: https://issues.apache.org/jira/browse/TIKA-1608 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Jeremy B. Merrill Attachments: 1534-attachment.doc Extracting text from the Word 97-2004 document located here (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11. Happy to answer questions; I'm also on the user mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505102#comment-14505102 ] Jeremy B. Merrill commented on TIKA-1608: - POI bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=57843 RuntimeException on extracting text from Word 97-2004 Document -- Key: TIKA-1608 URL: https://issues.apache.org/jira/browse/TIKA-1608 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Jeremy B. Merrill Attachments: 1534-attachment.doc Extracting text from the Word 97-2004 document attached here fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11. Happy to answer questions; I'm also on the user mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505093#comment-14505093 ] Jeremy B. Merrill commented on TIKA-1608: - Hi Tim, I added the document. I'm totally cool with the document being viewed by the public. I can't really grant it to the ASF since I didn't create it. It's an attachment from an email in an email dump (http://jebemail.com) posted by former Florida govenor Jeb Bush. So whether it's usable is probably a question for the ASF's lawyers. But for the avoidance of doubt, I grant any rights that I might have in the document to the ASF. I'll open a POI bug. RuntimeException on extracting text from Word 97-2004 Document -- Key: TIKA-1608 URL: https://issues.apache.org/jira/browse/TIKA-1608 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Jeremy B. Merrill Attachments: 1534-attachment.doc Extracting text from the Word 97-2004 document located here (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11. Happy to answer questions; I'm also on the user mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy B. Merrill updated TIKA-1608: Description: Extracting text from the Word 97-2004 document attached here fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11. Happy to answer questions; I'm also on the user mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.) was: Extracting text from the Word 97-2004 document located here (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11. Happy to answer questions; I'm also on the user mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.) RuntimeException on extracting text from Word 97-2004 Document -- Key: TIKA-1608 URL: https://issues.apache.org/jira/browse/TIKA-1608 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Jeremy B. Merrill Attachments: 1534-attachment.doc Extracting text from the Word 97-2004 document attached here fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread
[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
[ https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505178#comment-14505178 ] Jeremy B. Merrill commented on TIKA-1608: - It's the only one I've found so far out of 300,000ish documents (most of which are plain emails, few of which are .docs). RuntimeException on extracting text from Word 97-2004 Document -- Key: TIKA-1608 URL: https://issues.apache.org/jira/browse/TIKA-1608 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Jeremy B. Merrill Attachments: 1534-attachment.doc Extracting text from the Word 97-2004 document attached here fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11. Happy to answer questions; I'm also on the user mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document
Jeremy B. Merrill created TIKA-1608: --- Summary: RuntimeException on extracting text from Word 97-2004 Document Key: TIKA-1608 URL: https://issues.apache.org/jira/browse/TIKA-1608 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Jeremy B. Merrill Extracting text from the Word 97-2004 document located here (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with the following stacktrace: $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 1534-attachment.doc Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134) Caused by: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171) at org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101) at org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49) at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109) at org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281) ... 5 more I'm using trunk from Github, which I think is a flavor of 1.9. The document opens properly in Word for Mac '11. Happy to answer questions; I'm also on the user mailing list. If it's relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put that document here in Jira rather than on my own dropbox.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492540#comment-14492540 ] Jeremy B. Merrill commented on TIKA-1602: - Sounds about right, thanks for finding that for me. I'll go ahead and mark the issue a dupe or close it. Any idea when that patch'll get merged into trunk? (Or -- since I'm an svn n00b -- if there's a way for me to download that patched version.) Detecting standards-non-compliant emails as message/rfc822 -- Key: TIKA-1602 URL: https://issues.apache.org/jira/browse/TIKA-1602 Project: Tika Issue Type: New Feature Reporter: Jeremy B. Merrill Priority: Minor Original Estimate: 1h Remaining Estimate: 1h Tika does not properly detect certain emails as `message/rfc822` if they're slightly standards-non-compliant and begin with `Status: ` as the first header. I've added `Status: ` as a magic detection line in tika-mimetypes.xml. This solves my problem and does not appear to cause unit test failures. I have not yet run the tika-batch tests. As further information, the emails that are processed incorrectly come from dumps directly from various US public officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant. It's important to note that Tika (and the underlying library, James Mime4J) do properly *parse* these emails, despite the non-compliant header. The problem is getting Tika to *detect* the file as an email so that Mime4J gets chosen to parse it. Pull request on Github at https://github.com/apache/tika/pull/40 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822
[ https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy B. Merrill closed TIKA-1602. --- Resolution: Duplicate Detecting standards-non-compliant emails as message/rfc822 -- Key: TIKA-1602 URL: https://issues.apache.org/jira/browse/TIKA-1602 Project: Tika Issue Type: New Feature Reporter: Jeremy B. Merrill Priority: Minor Original Estimate: 1h Remaining Estimate: 1h Tika does not properly detect certain emails as `message/rfc822` if they're slightly standards-non-compliant and begin with `Status: ` as the first header. I've added `Status: ` as a magic detection line in tika-mimetypes.xml. This solves my problem and does not appear to cause unit test failures. I have not yet run the tika-batch tests. As further information, the emails that are processed incorrectly come from dumps directly from various US public officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant. It's important to note that Tika (and the underlying library, James Mime4J) do properly *parse* these emails, despite the non-compliant header. The problem is getting Tika to *detect* the file as an email so that Mime4J gets chosen to parse it. Pull request on Github at https://github.com/apache/tika/pull/40 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822
Jeremy B. Merrill created TIKA-1602: --- Summary: Detecting standards-non-compliant emails as message/rfc822 Key: TIKA-1602 URL: https://issues.apache.org/jira/browse/TIKA-1602 Project: Tika Issue Type: New Feature Reporter: Jeremy B. Merrill Priority: Minor Tika does not properly detect certain emails as `message/rfc822` if they're slightly standards-non-compliant and begin with `Status: ` as the first header. I've added `Status: ` as a magic detection line in tika-mimetypes.xml. This solves my problem and does not appear to cause unit test failures. I have not yet run the tika-batch tests. As further information, the emails that are processed incorrectly come from dumps directly from various US public officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant. It's important to note that Tika (and the underlying library, James Mime4J) do properly *parse* these emails, despite the non-compliant header. The problem is getting Tika to *detect* the file as an email so that Mime4J gets chosen to parse it. Pull request on Github at https://github.com/apache/tika/pull/40 -- This message was sent by Atlassian JIRA (v6.3.4#6332)